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~-^>or  the  current  ability  estimate.  The  study  was  conducted  over  the  winter 
semester  and  summer  session  of  1980  using  as  subjects  volunteers  from  a 
undergraduate  introductory  course  in  measurement  and  a  graduate/undergradu¬ 
ate  course  in  group  intelligence  testing.  Analyses  for  the  two  procedures 
included  a  determination  of  the  optimal  test  length,  a  comparison  of  the 
test-retest  reliability,  a  comparison  of  the  total  test  information,  a 
comparison  of  the  obtained  ability  estimates,  a  comparison  of  the  goodness 
of  fit  of  the  3PL  model  to  the  test  data,  and  the  compiling  of  descriptive 
statistics  including  average  testing  time  and  average  test  difficulty. 

Results  of  the  analyses  indicated  that  the  optimal  test  length  of  the 
Bayesian  procedure  was  14  items,  while  the  optimal  test  length  of  the 
maximum  likelihood  procedure  was  12  items.  No  difference  was  found  between 
the  procedures  in  terms  of  the  reliability  of  the  ability  estimates ^<Jhe 
Bayesian  procedure  yielded  greater  mean  total  test  information  than  tlyr~~-'~-~^ 
maximum  likelihood,  but  this  was  found  to  be  due  to  the  regression  of  the 
Bayesian  ability  estimates  to  the  mean  of  the  assumed  prior  distribution, 
where  more  information  was  available.  In  the  range  of  ability  where  there 
were  ability  estimates  for  both  procedures  there  was  no  difference  in  total 
test  information.  Further  analyses  showed  that  the  assumption  of  different 
priors  can  significantly  alter  the  ability  estimates  obtained  from  the 
Bayesian  tailored  test,  as  well  as  the  total  test  information  yielded  by 
the  test  and  the  optimal  length  of  the  test.  The  goodness  of  fit  compari¬ 
son  indicated  that  the  Bayesian  procedure  yielded  significantly  poorer  fit 
of  the  3PL  model  to  the  data  than  did  the  maximum  likelihood  procedure. 

Based  on  the  results  of  these  analyses  it  was  concluded  that  for  large 
scale  tailored  testing  a  maximum  likelihood  tailored  testing  procedure 
with  item  selection  based  on  information  is  the  procedure  of  choice. 
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A  Comparison  of  a  Bayesian  and  a  Maximum 
Likelihood  Tailored  Testing  Procedure 


It  is  possible  that  in  the  near  future  there  will  be  a  widespread 
usage  of  tailored  testing  as  an  alternative  to  paper-and-pencil  tests. 

For  example,  the  Armed  Services  plan  to  implement  tailored  testing  pro¬ 
cedures  in  the  near  future.  The  possibility  of  large-scale  implementa¬ 
tion  of  tailored  testing  has  increased  the  need  to  identify  the  optimal 
tailored  testing  procedures  among  the  many  that  are  available. 

When  selecting  a  tailored  testing  procedure  a  decision  must  be  made 
as  to  which  of  numerous  available  techniques  should  be  used  for  the  com¬ 
ponent  parts  of  the  tailored  testing  procedure.  For  instance,  one  re¬ 
quirement  for  tailored  testing  is  the  calibration  of  items.  For  the 
calibration  of  items  a  number  of  models  are  available  (e.g.,  one-,  two-, 
and  three-parameter  logistic  models),  and  for  each  model  there  may  be 
a  number  of  calibration  programs  available  (e.g.,  the  ANCILLES,  LOGIST, 
and  OGIVIA  procedures  for  the  three-parameter  logistic  model).  Two 
other  important  components  of  tailored  testing  are  the  item  selection 
procedure  and  the  ability  estimation  procedure.  While  there  are  several 
ability  estimation  procedures  available,  the  most  common  procedures  are 
Owen's  Bayesian  and  maximum  likelihood  estimation  procedures.  For 
selecting  items  the  two  most  frequently  used  procedures  are  either  to 
select  items  to  maximize  information  at  a  given  ability  level  or  to 
select  items  to  minimize  the  posterior  variance  of  the  ability  estimates. 
Wnile  a  number  of  studies  have  been  done  comparing  various  procedures 
available  for  a  number  of  components,  little  has  been  done  to  directly 
compare  tailored  testing  procedures  employing  these  different  ability 
estimation  and  item  selection  procedures.  The  purpose  of  the  current 
study,  then,  is  to  compare  in  a  live  testing  setting  tailored  testing 
procedures  based  on  maximum  likelihood  ability  estimation  and  maximum 
information  item  selection,  and  on  Owen's  Bayesian  ability  estimation 
(Owen,  1975)  and  minimum  posterior  variance  item  selection.  Before  pro¬ 
ceeding  with  a  presentation  of  the  current  study,  however,  previous 
studies  investigating  different  procedures  for  tailored  testing  will  be 
discussed. 


Comparison  of  Latent  Trait  Models 

Several  studies  have  been  done  to  compare  the  use  of  different  models 
for  tailored  testing.  One  such  study,  a  direct  comparison  of  the  1PL 
and  3PL  models  in  a  live  tailored  testing  setting,  was  reported  by  Koch 
and  Reckase  (1978).  The  purpose  of  this  study  was  to  compare  the  1PL 
and  3PL  models  in  a  tailored  testing  application  to  vocabulary  ability 
measurement.  Both  procedures  used  maximum  likelihood  techniques  for 
item  and  ability  parameter  estimation.  In  both  procedures  items  were 
selected  to  maximize  the  Information  function  at  the  current  ability 
estimate.  The  results  of  this  study  indicated  that  both  models  could  be 
successfully  applied  to  vocabulary  ability  measurement.  The  reliabilities 
reported  (a  combination  of  test-retest  and  equivalent  forms  reliabilities) 
Indicated  that  the  3PL  procedure  yielded  a  slightly  higher  reliability 
than  the  1PL  procedure  (rf.77  for  the  3PL  procedure  and  r*.61  for  the 
1PL  procedure).  One  important  finding  of  this  study  was  that.  If  care¬ 
ful  attention  is  not  paid  to  the  operational  characteristics  of  the  pro- 


cedures,  nonconvergence  of  the  maximum  likelihood  ability  estimation 
procedure  can  be  a  serious  problem.  In  this  study  the  3PL  procedure 
failed  to  converge  to  ability  estimates  in  nearly  one-third  of  the 
cases.  Nonconvergence  was  not  a  serious  problem  with  the  1PL  proce¬ 
dure. 


In  a  second  study,  reported  by  Koch  and  Reckase  (1979),  in  which 
the  1PL  and  3PL  models  were  applied  to  a  multidimensional  achievement 
test,  nonconvergence  of  the  3PL  maximum  likelihood  ability  estimation 
procedure  was  encountered  in  about  eight  percent  of  the  cases.  The 
substantial  reduction  in  nonconvergence  cases  over  the  previous  study 
was  attrituted  to  use  of  an  item  pool  of  more  appropriate  difficulty. 
Despite  the  reduction  of  the  number  of  cases  of  nonconvergence  in  this 
study,  the  results  still  indicated  a  number  of  problem  areas.  Relia¬ 
bilities  were  quite  low  for  both  procedures,  as  was  the  information 
yielded  by  both  procedures.  A  number  of  possible  explanations  were 
suggested  for  the  inadequate  performance  of  the  procedures.  Among 
these  were  unstable  item  parameter  estimates  due  to  small  sample  si2es, 
instability  due  to  poor  linking  procedures,  and  poor  selection  of 
entry  points  into  the  item  pool.  These  problems  appeared  to  have 
equally  serious  effects  on  both  the  1PL  and  3PL  procedures. 

A  study  reported  by  McKinley  and  Reckase  (1980a)  attempted  to 
correct  the  problems  encountered  in  the  Koch  and  Reckase  studies. 

Close  attention  was  paid  to  appropriate  item  parameter  linking  and 
entry  points  for  the  1PL  and  3PL  procedures.  The  results  of  this 
study  indicated  that  both  models  could  be  quite  successfully  applied 
to  tailored  testing  if  correctly  implemented.  Both  1PL  and  3PL 
reliabilities  were  higher  than  the  reliability  of  a  classroom  test 
over  the  same  material.  A  comparison  of  the  1PL  and  3 PL  procedures 
indicated  that  the  3PL  procedures  yielded  more  information  than  the 
1PL  procedure  or  the  classroom  test.  The  3PL  procedure  also  fit  the 
data  better  than  the  1PL  model.  This  study  concluded  that  for  tailored 
testing  applications  the  3PL  model  was  the  model  of  choice. 

A  similar  conclusion  was  reached  by  Urry  (1970,  1977b).  Through 
a  series  of  simulation  studies,  Urry  found  that  tailored  testing  be¬ 
comes  less  effective  when  a  model  with  insufficient  parameters  is  used. 
He  concluded  that  construct  validity  decreases  as  a  function  of  the 
degree  of  degeneracy  of  the  model,  and  the  1PL  model  was  particularly 
inappropriate  for  use  with  multiple-choice  items  because  it  did  not 
portray  multiple-choice  response  data  with  fidelity  (Urry,  1977b). 

Comparisons  of  Ability  Estimation  Procedures 

It  would  appear  to  be  clearly  established  in  the  literature  that 
for  tailored  testing  applications  the  3PL  model  is  more  appropriate 
than  the  1PL  model.  However,  very  little  appears  in  the  literature 
concerning  the  optimal  procedures  for  ability  estimation  and  item  selec 
tion  to  be  used  in  the  3PL  tailored  testing  procedure. 
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One  study  that  did  compare  different  ability  estimation  procedures 
was  conducted  by  Maurelli  (1978).  In  this  study  a  comparison  was  made 
of  maximum  likelihood  and  Bayesian  ability  estimation  procedures  in  a 
simulated  stradaptive  testing  application.  The  Bayesian  ability  esti¬ 
mation  procedure  was  a  modification  of  the  procedure  proposed  by  Owen 
(1975),  and  the  maximum  likelihood  estimation  procedure  was  similar 
to  the  one  proposed  by  Lord  (1975).  Modifications  included  altering 
item  selection  procedures  to  make  them  compatible  with  the  branching 
scheme  of  the  stradaptive  model.  The  variables  investigated  included 
the  ability  estimation  procedures,  test  lengths  (15,  30,  and  45  items), 
and  the  use  or  non-use  of  prior  information  to  determine  entry  level 
(variable  entry  point).  The  conclusions  reached  in  this  study  included 
the  finding  that  the  maximum  likelihood  procedure  performed  best  over¬ 
all  when  quality  of  performance  was  measured  in  terms  of  bias  (mean 
error  of  estimate),  linearity  of  the  regression  of  §  on  0,  average 
information,  and  fidelity  (correlation  of  §  with  e).  The  Bayesian  pro¬ 
cedure  showed  acceptable  performance  only  at  the  longest  test  length 
when  using  prior  information  to  determine  the  entry  point.  This  proce¬ 
dure  was  found  to  be  most  deficient  in  the  lower  third  of  the  ability 
scale.  Maurelli  also  concluded  that  assuming  a  normal  prior  assures 
a  regression  of  the  estimates  towards  the  mean  of  that  prior.  Unfortu¬ 
nately,  since  this  study  was  conducted  using  the  classic  stradaptive 
item  selection  procedure  for  both  ability  estimation  procedures,  no 
comparison  of  item  selection  procedures  was  made. 

From  a  review  of  the  literature  it  is  apparent  that  there  is  little 
evidence  for  determining  whether  maximum  likelihood  or  Bayesian  esti¬ 
mation  is  better,  in  general,  for  any  application.  Virtually  no  attempt 
has  been  made  to  directly  compare  the  two  procedures  in  a  live  tailored 
testing  application.  Nor  has  there  been  any  comparison  of  the  two  most 
common  item  selection  procedures  used  with  these  estimation  procedures. 
The  purpose  of  the  present  study,  then,  is  to  compare, in  a  live  tailored 
testing  application,  maximum  likelihood  ability  estimation  and  maximum 
information  item  selection  with  minimum  posterior  variance  item  selec¬ 
tion  and  Bayesian  ability  estimation. 

Method 


Item  Pool 


Both  the  Bayesian  and  the  maximum  likelihood  tailored  tests  used 
the  same  pool  of  137  items.  Items  used  for  this  study  were  selected 
from  the  first  and  third  subtests  of  the  School  and  College  Ability 
Tests  (SCAT),  forms  2A  and  3A.  Estimates  of  the  3PL  item  parameters 
were  obtained  from  the  Educational  Testing  Service  (ETS).  The  distri¬ 
butions  of  the  item  parameter  estimates  are  shown  in  Figures  1-A, 

1-B,  and  1-C,  and  a  summary  of  the  descriptive  statistics  for  these  dis¬ 
tributions  Is  presented  In  Table  1.  As  can  be  seen  in  Figure  1-A  the 
item  discriminations  (^-values)  were  fairly  evenly  distributed,  with 
most  of  the  items  having  ^-values  greater  than  .75.  The  item  diffi¬ 
culties  (b-values),  shown  in  Figure  1-B,  were  approximately  normally 
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dlstributed  between  -2.0  and  +2.0,  with  a  slight  tail  at  the  negative 
end.  The  item  guessing  values  (^-values)  were  very  tightly  clustered 
around  the  mean  of  .14.  In  terms  of  these  item  parameter  estimates  the 
item  pool  was  of  high  quality.  It  came  very  close  to  satisfying  the 
requirements  for  tailored  testing  item  pools  set  out  by  Urry  (1977a). 

Table  1 

Summary  of  Descriptive  Statistics  of  Item 


Parameter  Estimates  for  Tailored  Testing  Item  Pool 


Statistic 

ai 

“i 

ci 

Mean 

1.12 

-.05 

.14 

Median 

1.04 

.06 

.14 

St.  Dev. 

.46 

1.17 

.05 

Skewness 

.27 

-.22 

1.61 

Kurtosis 

-.90 

2.14 

6.15 

Minimum 

.14 

-4.25 

.06 

Maximum 

1.94 

4.56 

.39 

Note:  The  item  pool  contained  137  items. 


Figure  2  shows  the  total  Information  curve  for  the  item  pool.  As 
can  be  seen  in  the  figure  the  curve  is  slightly  negatively  skewed.  The 
curve  Is  very  high  near  the  center  of  the  ability  estimate  range  and 
drops  off  rather  sharply  toward  the  extremes  of  the  ability  estimate 
range.  Information  was  high  between  -1.5  and  +2.5,  but  outside  the  range 
not  much  Information  was  available. 


INFORMATION 
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Tailored  Testing  Procedures 

In  general,  tailored  testing  procedures  have  three  main  components: 
an  item  selection  routine,  an  ability  estimation  procedure,  and  a  stop¬ 
ping  rule.  In  this  study  two  combinations  of  item  selection  and  ability 
estimation  procedures  were  used.  For  one  group  of  subjects,  items  that 
had  the  most  item  information  (Birnbaum,  1968)  at  the  most  recent  ability 
estimate  were  selected  and  a  maximum  likelihood  estimation  procedure 
using  all  previous  responses  was  used  for  ability  estimates.  For  the 
rest  of  the  subjects  items  were  selected  to  minimize  the  posterior  vari¬ 
ance  of  the  ability  estimate  distribution  and  Owen's  Bayesian  ability 
estimation  procedure  was  used  for  ability  estimates.  These  procedures 
will  be  described  in  greater  detail  shortly. 

Before  testing  began  no  ability  estimates  were  available  for  the  sub¬ 
jects,  so  initial  estimates  were  assigned  to  determine  the  starting  points 
in  the  item  pool.  For  both  procedures  the  initial  ability  estimates  were 
randomly  assigned  to  each  subject  to  be  either  +.150  or  -.100.  These 
values  represent  difficulty  values  near  the  center  of  the  item  pool  dif¬ 
ficulty  distribution  with  one  starting  point  on  either  side  of  the  median. 
For  the  second  session  subjects  were  assigned  the  alternative  initial 
ability  estimate  in  order  to  provide  different  initial  items  from  one  ses¬ 
sion  to  the  next.  Both  procedures  used  the  same  stopping  rule.  The 
tailored  tests  continued  until  20  items  had  been  administered. 

Ability  Estimation  and  Item  Selection  Procedures 

For  the  maximum  likelihood  tailored  tests,  items  were  selected  for 
administration  that  yielded  the  maximum  item  information  at  the  most  re¬ 
cent  ability  estimate.  For  the  3PL  model  the  formula  for  item  informa¬ 
tion  is  given  by 

=  D2ai2i/»[DLi(0j)]  -  D2a-Pi(ej)^[DLi(ej)  -  log  c^  (!) 

where  1.(0.)  is  the  value  of  the  item  information  at  Ability  6.,  L . (0 - )  = 

1  J  J  *  J 

a.^.-b.),  P.(eJ  is  the  probability  of  a  correct  response  to  Item  i  given 

*  J  *  •  J 

Ability  0j,  and  ip( x)  is  the  logistic  probability  density  function.  Total 
test  information  is  the  sum  of  the  item  information  values: 

n 

1(6.)  =  Z  Me.).  (2) 

J  i=i  1  J 

Formula  1  was  used  in  the  tailored  testing  procedure  to  compute  the  infor¬ 
mation  for  each  item  in  the  item  pool  at  the  examinee's  current  ability 
estimate.  The  item  with  the  greatest  information  at  the  ability  estimate 
was  then  administered  to  the  examinee.  The  first  item  was  selected  to 
maximize  information  at  the  initial,  randomly  assigned  ability  estimate. 

If  that  item  were  correctly  answered  the  ability  estimate  was  Increased 
by  a  fixed  stepsize  of  .4,  and  if  it  were  Incorrectly  answered  the  ability 
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estimate  was  decreased  by  the  fixed  stepsize.  The  .4  stepsize  was  selec¬ 
ted  on  the  basis  of  previous  research  as  giving  the  best  combination  of 
minimum  error  and  least  statistical  bias  (Patience  and  Reckase,  1980). 
This  fixed  stepsize  procedure  was  used  until  a  maximum  likelihood  ability 
estimate,  the  mode  of  the  likelihood  distribution,  could  be  obtained 
(i.e.,  when  both  correct  and  incorrect  responses  were  obtained).  Each 
new  item  was  then  selected  to  maximize  the  information  at  the  new  ability 
estimate,  with  the  restriction  that  no  item  could  be  used  more  than  once. 


For  the  Bayesian  tailored  tests  items  were  selected  to  minimize  the 
posterior  variance  of  the  ability  estimate  distribution.  Owen's  proce¬ 
dure  assumes  a  normal  distribution  of  ability  as  a  prior.  In  this  study 
the  mean  of  that  prior  distribution  was  set  equal  to  the  initial,  randomly 
assigned  ability  estimate,  and  the  standard  deviation  of  the  prior  was 
set  equal  to  one.  The  first  item  was  then  selected  so  as  to  result  in 
the  greatest  possible  reduction  in  the  standard  error  of  estimate  (urry, 
1977a).  This  was  accomplished  in  the  following  manner.  Rather  than 
computing  the  actual  value  of  the  standard  error  of  estimate,  a  stati¬ 
stic  labelled  a  by  Jensema  (1974), was  used  for  efficiency  of  computation. 
The  expected  standard  error  is  a  function  of  a.  The  value  o^-  was  calcu¬ 
lated  for  each  item  in  the  item  pool  according  to  the  formula. 


1U2-U.)  exp  (2Di2) 

a  =  - 3 - 1 - i -  (3) 

2(l-ci)ti 

where  c.  is  the  item  guessing  value  and  the  following  relationships  hold: 

b.  -  0. 
i  J 

D  =  — - -  ,  (4) 

1  i2(a ♦cja)]»/i 


erf  D.  =  *^T~  /  exp(-x2)dx, 
1  0 


(5) 


ui  =  1  -  erf  D.j , 


(6) 


(1 


ct,ui 


+ 


2 


(7) 
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where 


Vi=l+erfDi.  (14) 

Once  new  estimates  of  ability  and  variance  were  calculated  new  oi  values 
were  computed  for  all  unused  items  in  the  item  pool.  The  above  Sequence 
was  then  repeated.  For  further  discussion  of  the  mechanics  of  this  pro¬ 
cedure  see  Owen  (1975),  Urry  (1971),  Jensema  (1972)  or  Jensema  (1974). 

One  further  point  about  this  procedure  that  should  be  noted  is  that  the 
prior  assumption  of  normality  is  maintained  throughout  the  procedure. 

That  is,  the  distribution  of  ability  is  not  recomputed  after  each  item. 
Rather,  it  is  assumed  to  remain  normal  with  a  mean  equal  to  the  current 
ability  estimate  and  variance  equal  to  the  current  estimate  of  variance. 

Design 

This  study  employed  a  test-retest  design,  with  two  sessions  one 
week  apart.  Subjects  were  randomly  assigned  to  take  either  a  maximum 
likelihood  or  a  Bayesian  tailored  test,  and  subjects  were  randomly 
assigned  an  initial  ability  estimate  of  -.100  or  +.150  for  the  first 
session.  For  the  second  session  subjects  were  assigned  to  the  alter¬ 
native  initial  ability  estimate  from  the  one  assigned  the  first  session. 
Subjects  received  the  same  type  of  test,  Bayesian  or  maximum  likelihood, 
for  both  sessions  in  order  to  make  test-retest  reliability  comparisons 
possible.  The  tailored  tests  were  administered  on  Applied  Digital  Data 
Systems  (ADDS)  Consul  980  cathode  ray  tube  terminals  connected  to  an 
Amdahl  470/V7  via  Time  Sharing  Option  facilities. 

Samp! e 

This  study  was  conducted  over  the  winter  semester  and  summer  session 
of  1980.  For  the  rest  of  this  paper  both  will  be  referred  to  as  semesters. 
The  winter  semester  study  was  conducted  using  34  volunteers  from  an  intro¬ 
ductory  course  in  measurement.  Of  these  34  volunteers,  31  were  female 
and  three  were  male,  33  were  seniors  and  one  was  a  graduate  student.  The 
second  semester  subjects  included  volunteers  from  two  courses,  the  intro¬ 
ductory  course  in  measurement  mentioned  above,  and  a  graduate/undergradu¬ 
ate  course  in  group  intelligence  testing.  There  were  36  volunteers,  of 
whom  14  were  in  the  introductory  course  in  measurement.  The  remaining 
22  were  from  the  group  intelligence  testing  course.  There  were  25  fe¬ 
males  in  this  second  semester  group,  and  11  males.  There  were  14  gradu¬ 
ate  students,  15  seniors,  six  juniors,  and  one  sophomore. 

Analyses 

Before  any  of  the  planned  analyses  were  performed,  preliminary  analyses 
were  performed  to  determine  whether  data  from  the  two  semesters  should  be 
combined.  These  analyses  Included  the  plotting  and  visual  comparison  of 
the  ability  estimate  distributions  obtained  for  the  two  semesters  and  a 
comparison  of  the  ability  estimate  means  from  the  two  semesters  using 
analysis  of  variance  (AN0VA)  procedures.  Because  the  second  semester 
study  Included  students  enrolled  In  graduate  school,  and  because  it  occurred 
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during  a  summer  semester,  there  was  some  reason  to  suspect  the  two  groups 
were  not  comparable  and  should  not  be  combined. 

Figures  3  and  4  show  the  ability  estimate  distributions  for  the 
Bayesian  tailored  tests  for  the  winter  and  summer  semesters,  respectively. 
Figures  5  and  6  show  the  maximum  likelihood  tailored  test  ability  esti¬ 
mate  distributions  for  the  winter  and  summer  semesters,  respectively. 

A  visual  comparison  of  these  plots  indicated  that  the  ability  estimates 
from  the  summer  semester  tended  to  be  higher  than  the  winter  semester 
ability  estimates  for  both  the  Bayesian  and  the  maximum  likelihood  tail¬ 
ored  tests.  This  indicated  that  the  subjects  in  the  summer  semester  may 
have  had  higher  vocabulary  ability  than  the  winter  semester  subjects, 
since  both  groups  took  tests  using  the  same  item  pool.  The  means  of 
these  distributions  are  shown  in  Table  2. 


FIGURE  3 

ABILITY  ESTIMATE  FREQUENCY 
DISTRIBUTION  FOR  BAYESIAN 
TAILORED  TESTS  FOR 
WINTER  COMBINED  OVER 
SESSIONS 
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FIGURE  4 

ABILITY  ESTIMATE  FREQUENCY 
DISTRIBUTION  FOR  BAYESIAN 
TAILORED  TESTS  FOR  SUMMER 


FIGURE  5 

ABILITY  ESTIMATE  FREQUENCY 
DISTRIBUTION  FOR  MAXIMUM  . 


ABILITY  ESTIMATE 


FREQUENCY 
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FIGURE  6 

ABILITY  ESTIMATE  FREQUENCY 
DISTRIBUTION  FOR  MAXIMUM 
LIKELIHOOD  TAILORED  TESTS  FOR 
SUMMER  COMBINED  OVER  SESSION 


Table  2 

Mean  Ability  Estimates  for  Bayesian  and  Maximum 
Likelihood  Tailored  Tests  for  Winter  and  Sumner 


Test 


Spring 


Sumner 


Session  1  Session  2  Session  1  Session  2 


.75 


.86 


Bayesian 

Maximum 

Likelihood 


1.25 


1.30 


l.lB 

1.53 


1.27 

1.56 
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In  an  attempt  to  confirm  that  there  was  a  real  difference  between 
the  groups  a  three-way  analysis  of  variance  with  repeated  measures  on 
one  factor  was  performed.  The  independent  variables  were  test  (Bayesian 
or  maximum  likelihood),  semester  (winter  or  summer) ,  and  session.  The 
repeated  measures  were  over  the  sessions.  In  order  to  facilitate  the 
interpretation  of  the  results  of  this  analysis  the  Bayesian  and  maximum 
likelihood  ability  estimates  were  put  on  the  same  scale  by  converting 
them  to  Z-scores.  The  ^-scores  were  computed  for  each  procedure  using 
the  withTn  procedure  means  and  standard  deviations.  This  put  both  sets 
of  ability  estimates  on  the  same  scale,  thus  eliminating  any  differences 
due  to  different  scales.  This  was  done  because  at  this  stage  the  dif¬ 
ferences  in  the  procedures  were  not  an  issue.  The  ANOVA  was  then  run 
using  the  Z-scores  as  the  dependent  variable.  The  results  of  the  ANOVA 
are  shown  Tn  Table  3. 


Table  3 

Analysis  of  Variance  Table 
for  Preliminary  Comparison  of  Winter 
and  Summer  Ability  Estimate  Distributions 


Source 

SS_ 

df 

MS 

F 

E 

Test 

35.96 

1 

35.96 

0.20 

.655 

Semester 

1406.51 

1 

1406.51 

7.89 

.006 

j 

Test  x  semester 

35.96 

1 

35.96 

0.20 

.655 

Error 

11942.65 

67 

178.25 

Session 

50.87 

1 

50.87 

6.63 

.012 

Test  x  Session 

9.67 

1 

9.67 

1.26 

.265 

Semester  x  Session 

2.03 

1 

2.03 

0.26 

.608 

Test  x  Semester  x 

Session  0.10 

1 

0.10 

0.01 

.910 

Error 

513.97 

67 

7.67 
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Because  Z_-$cores  were  used  the  presence  of  different  scales  for 
the  two  test  procedures  did  not  result  in  a  problem  in  interpretation. 

Also,  because  the  means  for  the  two  tests  were  set  to  50  the  test  main 
effect  was  eliminated.  The  semester  main  effect,  however,  was  signifi¬ 
cant  (£=7.89,  £<.01),  indicating  that  the  examinees  in  the  summer  study 
had  significantly  higher  vocabulary  ability  estimates.  Because  of  this 
the  decision  was  made  not  to  combine  the  two  groups,  but  rather  to  treat 
semester  as  an  independent  variable  with  two  levels. 

Once  the  determination  was  made  that  the  data  from  the  two  semesters 
should  not  be  combined,  a  number  of  analyses  were  performed  separately 
on  the  two  sets  of  data.  The  first  analysis  was  the  determination  of  op¬ 
timal  test  lengths  for  the  two  procedures.  This  analysis  was  performed 
since  the  administration  of  inappropriate  items  may  induce  ability  esti¬ 
mate  bias.  It  is  important  to  not  allow  the  tailored  tests  to  continue 
beyond  the  optimal  length.  As  the  items  appropriate  for  an  ability  are 
used  up,  bias  can  be  introduced  into  the  ability  estimates  if  the  test 
is  continued,  since  the  procedure  may  begin  to  administer  less  appropriate 
items  (Reckase,  1974).  Therefore,  it  is  important  that  the  tailored  tests 
do  not  continue  beyond  the  optimal  length.  The  test  length  analysis  was 
accomplished  by  plotting  the  convergence  of  the  procedures  to  ability 
estimates  for  each  tailored  test.  That  is,  for  a  given  tailored  test  the 
ability  estimate  obtained  after  each  item  was  plotted  against  the  item 
number.  Then  a  second  plot  was  done  as  an  overlay  on  the  same  set  of 
axes.  For  the  maximum  likelihood  tests  the  overlay  was  the  plot  of  the 
item  information  that  was  obtained  for  each  item  at  the  previous  ability 
estimate  against  the  item  number.  For  the  Bayesian  tests  the  overlay  was 
the  plot  of  the  standard  error  of  estimate  obtained  after  each  item  against 
the  item  number.  The  purpose  of  these  plots  was  to  graphically  represent 
the  interrelationships  of  test  length,  ability  estimate,  and  item  informa¬ 
tion  or  standard  error  of  estimate,  so  that  a  determination  could  be  made 
as  to  what  test  length  and  item  information  or  standard  error  of  esti¬ 
mate  would  be  optimal  as  cutoff  values  for  terminating  the  tailored  tests. 

Other  analyses  performed  included  comparisons  of  the  Bayesian  and 
maximum  likelihood  test-retest  reliabilities,  the  total  test  information 
yielded  by  the  two  procedures,  and  the  ability  estimates  yielded  by  the 
two  procedures.  All  of  these  comparisons  were  made  using  the  20  item 
level  as  well  as  at  the  various  test  lengths  determined  by  the  optimal 
cutoff  analyses.  All  correlations  used  in  the  reliability  analyses  were 
computed  using  both  ability  estimates  and  estimated  true  scores  (Lord, 
1979).  The  computation  of  the  estimated  true  scores  was  accomplished 
by  summing  the  probabilities  of  correct  responses  at  the  examinee's  final 
ability  estimate  for  all  the  Items  in  the  item  pool.  The  formula  for 
estimated  true  scores  Is  as  follows: 

n 

t(04}  "i «!  P1(6j)  • 

where  t(6j)  is  the  estimated  true  score  for  Examinee  j. 


(15) 
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The  reliabilities  computed  for  this  study  were  not  strictly  test- 
retest  reliabilities,  but  rather  a  mixture  of  test-retest  and  equivalent 
forms  reliabilities,  since  the  tests  in  one  session  were  not  identical 
to  tests  taken  in  the  other  session.  The  hypothesis  that  all  of  the 
reliabilities  were  estimates  of  the  same  reliability  was  tested  using 
a  chi-square  test  given  in  Snedecor  and  Cochran  (1980). 

The  total  test  information  analyses  were  done  to  compare  the  amount 
of  information  yielded  at  the  final  ability  estimate  by  the  two  proce¬ 
dures.  Total  test  information  was  computed  using  Equations  1  and  2, 
where  the  summation  in  Equation  2  is  over  the  items  of  each  tailored  test. 
Total  test  informations  were  compared  using  both  plots  and  analysis  of 
variance  procedures. 

Comparisons  of  the  ability  estimates  included  a  number  of  analyses. 

One  analysis  was  the  comparison  of  the  distributions  of  ability  estimates 
yielded  by  the  two  procedures  using  plots  of  the  distributions.  Also, 
analysis  of  variance  procedures  were  used  to  compare  the  mean  ability 
estimates.  Another  comparison  involved  the  use  of  the  Bayesian  and  maxi¬ 
mum  likelihood  ability  estimation  procedures  with  the  item  selection  pro¬ 
cedures  switched.  That  is,  Bayesian  ability  estimates  were  obtained 
for  the  items  selected  by  the  maximum  likelihood  tailored  test  procedure, 
and  maximum  likelihood  ability  estimates  were  obtained  for  the  items 
selected  by  the  Bayesian  tailored  test  procedure.  The  purpose  of  this 
analysis  was  to  determine  whether  the  differences  found  between  the  two 
procedures  were  due  solely  to  the  ability  estimation  procedures,  or  whether 
the  item  selection  procedures  also  had  an  effect. 

Another  set  of  analyses  performed  was  the  comparison  of  the  items 
that  were  administered  by  the  tailored  tests.  Included  in  these  analyses 
were  a  comparison  of  the  items  administered  by  the  two  procedures  and 
a  comparison  for  each  procedure  of  the  items  administered  for  the  two 
sessions. 

The  goodness  of  fit  of  the  3PL  model  to  the  test  data  for  the  two 
procedures  were  also  compared.  The  goodness  of  fit  statistic 
used  in  this  study  was  the  mean  square  deviation  (MSD),  calculated  by 
summing  over  examinees  the  squared  differences  between  the  actual  re¬ 
sponses  to  the  items  and  the  expected  responses  to  the  items  (probability 
of  a  correct  response)  as  predicted  by  the  model  (Reckase,  1977).  The 
formula  for  the  MSD  statistic  is 


n 

MSD.  =  z 
J  1-0 


- 


n 


j 


(16) 


where  MSDj  is  tho  mean  squared  deviation  for  Examinee  j,  u^  Is  the  actual 
response  to  Item  i  by  Examinee  j,  P^(0j)is  the  probability  of  a  correct 
response  to  Item  i  by  Examinee  j  determined  from  the  model  using  the  final 
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ability  estimate  and  the  Item  parameter  estimates,  and  n.  is  the  number 
of  items  in  the  tailored  test  for  Examinee  j.  The  goodness  of  fit 
of  the  two  procedures  was  compared  using  the  MSD  statistic  as  the  depen¬ 
dent  variable  in  a  dependent  t^-test. 

Other  analyses  run  on  the  data  included  two  correlational  analyses. 
One  such  analysis  performed  was  the  correlation  of  item  response  latency 
times  with  the  ability  estimates.  Correlations  were  obtained  between 
mean  item  response  latency  times  and  final  ability  estimates.  Also  the 
mean  item  response  latency  times  for  correct  responses  and  incorrect  res¬ 
ponses  were  compared  using  an  analysis  of  variance  procedure. 

A  final  set  of  analyses  performed  was  the  compilation  of  descriptive 
statistics  for  the  two  procedures  for  both  sessions.  Descriptive  stati¬ 
stics  included  average  testing  time  and  average  test  difficulty. 

Resul ts 


Optimal  Cutoffs 

Figure  7  shows  typical  convergence  plots  that  were  obtained  for  one 
person  using  the  winter  data  for  the  maximum  likelihood  tailored  tests 
and  Figure  8  shows  typical  convergence  plots  obtained  for  one  person  using 
the  winter  data  for  the  Bayesian  tailored  tests.  The  values  of  the  ability 
estimates  and  the  item  information  estimates  at  the  estimated  ability 
that  were  plotted  in  Figure  8  are  shown  in  Table  5.  These  figures  and 
tables  and  others  like  these  were  examined  in  order  to  determine  the  min- 
mum  test  length  at  which  the  ability  estimates  obtained  from  the  two  pro¬ 
cedures  appeared  to  be  stable.  In  the  plots  of  the  ability  estimates  ob¬ 
tained  from  the  maximum  likelihood  tailored  test  procedure  the  curve  ap¬ 
peared  to  flatten  out  at  about  12  items,  indicating  that  12  items  was  a 
sufficient  length  for  the  tailored  tests.  For  the  Bayesian  procedure 
the  curves  flattened  out  around  the  14  item  level.  In  terms  of  item  infor¬ 
mation  the  12  item  cutoff  for  the  maximum  likelihood  procedure  would 
represent  an  information  cutoff  of  approximately  1.64.  That  is,  the 
average  item  information  for  Item  12,  using  the  12  item  ability  estimates, 
was  1.64  (n=18).  The  Bayesian  cutoff  of  14  items  would  represent  a  stan¬ 
dard  error  of  estimate  cutoff  of  .25,  which  was  the  average  standard 
error  of  estimate  of  the  14  item  level  (n=16). 


THETA/ INFORMAT I ON  THETA/ INFORMATION 


FIGURE  7 

ABILITY  ESTIMATES  AND  INFORMATION  VALUES  AFTER  EACH 
ITEM  IN  A  MAXIMUM  LIKELIHOOD  TAILORED  TEST  FOR  THE 
WINTER  SEMESTER 
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FIGURE  8 


ABILITY  ESTIMATES  AND  STD  ERRORS  OF  ESTIMATE  AFTER 
c.  EACH  ITEM  IN  A  BAYESIAN  TAILORED  TEST  FOR  THE  WINTER 

'1  SEMESTER 
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Table  4 

Ability  Estimates  and  Item  Information  for 
Both  Sessions  of  a  Maximum  Likelihood  Tailored  Test 
for  the  Winter  Semester 


1 

Session 

1 

Session 

2 

item  - 

Abl 1 Ity  Estimate 

Information 

Ability  Estimate 

Information 

0 

-.100 

2.721 

.150 

2.850 

1 

.300 

2.884 

.550 

2.994 

2 

.700 

2.952 

.950 

3.003 

3 

.487 

2.943 

.701 

2.952 

4 

.654 

2.752 

.847 

2.705 

5 

.548 

2.607 

.964 

2.584 

6 

'  .654 

2.413 

.817 

2.637 

7 

.692 

2.371 

.855 

2.259 

8 

.768 

2.361 

.708 

2.363 

9 

.690 

2.301 

.786 

2.224 

10 

.614 

2.288 

.639 

2.276 

tl 

.621 

2.225 

.677 

2.088 

*2 

.699 

2.055 

.571 

1.969 

13 

.553 

1.966 

.619 

1.880 

14 

.560 

1.964 

.626 

1.838 

15 

.598 

1.651 

.664 

1.814 

16 

.635 

1.846 

.712 

1.887 

17 

.673 

1.793 

.664 

1.795 

18 

.721 

1.911 

.712 

1.796 

19 

.759 

1.826 

.760 

1.657 
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Table  5 

Ability  Estimates  and  Standard  Errors  of  Estimate 
for  Both  Sessions  of  a  Bayesian  Tailored  Test 
for  the  Winter  Semester 


Item 

Session  1 

Session  2 

Ability 

Estimate 

Standard  Error 
of  Estimate 

Abi 1 i ty 
Estimate 

Standard  Error 
of  Estimate 

0 

-.100 

1.000 

.150 

1.000 

1 

.579 

.818 

.872 

.790 

2 

1.043 

.660 

1.216 

.650 

3 

.551 

.502 

.660 

.485 

4 

.760 

.440 

.827 

.435 

5 

.884 

.401 

.964 

.399 

6 

.995 

.371 

1.113 

.371 

7 

1.124 

.348 

1.270 

.348 

8 

1.264 

.327 

1.155 

.316 

9 

1.161 

.300 

1.019 

.292 

10 

1 .037 

.279 

1.065 

.281 

II 

1.078 

.268 

1.119 

.272 

12 

1.126 

.261 

1.173 

.263 

13 

1.177 

.253 

1  .214 

.257 

14 

1.214 

.247 

1.146 

.243 

15 

1.275 

.243 

1.168 

.239 

16 

1.176 

.232 

1.082 

.228 

17 

1 . 101 

.223 

1 .102 

.224 

18 

1.121 

.219 

1.037 

.217 

19 

1.138 

.217 

1.057 

.212 

20 

1.165 

.212 

1.086 

.209 
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FIGURE  9 

ABILITY  ESTIMATES  AND  INFORMATION  VALUES  AFTER  EACH 
ITEM  IN  A  MAXIMUM  LIKELIHOOD  TAILORED  TEST  FOR  THE 
SUMMER  SEMESTER 
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FIGURE  10 

ABILITY  ESTIMATES  AND  STD  ERRORS  OF  ESTIMATE  AFTER  EACH 
ITEM  IN  A  BAYESIAN  TAILORED  TEST  FOR  THE  SUMMER  SEMESTER 
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Figure  9  shows  typical  convergence  plots  for  the  maximum  likelihood 
tailored  tests  for  the  summer  semester  and  Figure  10  shows  typical 
Bayesian  convergence  plots  for  the  summer  semester.  Table  6  shows  the 
ability  estimates  and  item  information  estimates  at  the  estimated  ability 
used  for  Figure  9  and  Table  7  shows  the  ability  estimates  and  standard 
errors  of  estimate  used  in  Figure  10.  A  visual  examination  of  these 
plots  indicates  that  the  optimal  cutoffs  for  the  summer  semester  data 
were  roughly  the  same  as  those  determined  for  the  winter  data. 

The  mean  item  informations  and  mean  standard  errors  of  estimate  at 
the  12  and  14  item  levels,  respectively,  were  not  significantly  different 
from  the  values  obtained  for  the  winter  semester  data.  Because  of  this, 
all  of  the  analyses  were  performed  using  ability  estimates  based  on 
tailored  tests  that  were  12  items  long,  14  items  long,  and  20  items  long, 
regardless  of  the  semester  or  whether  the  testing  procedure  was  Bayesian 
or  maximum  likelihood.  In  this  way  the  Bayesian  and  maximum  likelihood 
tailored  test  procedures  could  be  compared  at  their  respective  optimal 
cutoffs  and  at  various  equal  test  lengths. 

Table  6 

Ability  Estimates  and  Item  Information  for  Both 
Sessions  of  a  Maximum  Likelihood  Tailored  Test 
for  the  Summer  Semester 


Session  1 


Session  2 


item 

Ability  Estimate 

Information 

Ability  Estimate 

Information 

0 

.150 

2.850 

-.100 

2.721 

1 

.550 

2.994 

.300 

2.884 

2 

.950 

3.003 

.700 

2.952 

3 

1.350 

2.942 

1.100 

2.922 

4 

1.750 

2.898 

1.500 

2.990 

5 

2.150 

2.794 

1.900 

2.853 

6 

1.772 

2.383 

1.708 

2.571 

7 

1.626 

2.077 

1.783 

2.396 

8 

1.701 

1.869 

1.637 

2.072 

9 

1.749 

1.800 

1.470 

2.285 

10 

1.756 

1.541 

1.518 

2.049 

11 

1.804 

1.452 

1.565 

1.764 

12 

1.658 

1.473 

1.398 

1  823 

13 

1.465 

1.684 

1.406 

1.742 

14 

1.473 

1.644 

1.259 

1.812 

15 

1.356 

1.804 

1.143 

2.002 

16 

1.363 

1.708 

1.190 

1.889 

17 

1.411 

1.542 

1.198 

1.857 

18 

1.404 

1.464 

1.205 

1.750 

19 

1.411 

1.447 

1.212 

1.713 
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Table  7 

Ability  Estimates  and  Standard  Errors  of  Estimate  for 
Both  Sessions  of  a  Bayesian  Tailored  Test 
for  the  Summer  Semester 


Item 

Session  1 

Session  2 

Ability  Estimate 

Standard  Error 
of  Estimate 

Ability  Estimate 

Standard  Error 
of  Estimate 

0 

.150 

1.000 

-.100 

1.000 

1 

-.429 

.741 

.579 

.818 

2 

-.036 

.660 

1.043 

.660 

3 

.358 

.569 

.551 

.502 

4 

.694 

.491 

.760 

.440 

5 

.857 

.438 

.884 

.401 

6 

.630 

.375 

.995 

.371 

7 

.761 

.352 

.824 

.332 

8 

.561 

.316 

.632 

.298 

9 

.632 

.302 

.711 

.286 

10 

.713 

.290 

.760 

.276 

11 

.770 

.279 

.809 

.268 

12 

.694 

.263 

.874 

.259 

13 

.743 

.255 

.824 

.247 

14 

.805 

.249 

.860 

.241 

15 

.742 

.239 

.797 

.232 

16 

.765 

.232 

.827 

.226 

17 

.796 

.228 

.872 

.221 

18 

.763 

.221 

.930 

.219 

19 

.787 

.217 

.878 

.212 

20 

.830 

.212 

.911 

.208 

Reliabilities 


Table  8  shows  the  test-retest  reliabilities  that  were  obtained  for 
this  study.  It  includes  reliabilities  at  the  12,  14,  and  20  item  levels, 
for  both  the  Bayesian  and  maximum  likelihood  tailored  tests,  for  both  the 
winter  and  summer  session  data.  The  reliabilities  in  Table  8  were  com¬ 
puted  using  both  ability  estimates  and  estimated  true  scores.  Fisher's 
r  to  ^  transformation  was  applied,  and  then  a  chi-square  test  (Snedecor 
and  Cochran,  1980)  was  performed  to  determine  whether  all  the  reliabilities 
were  estimates  of  the  same  reliability.  The  obtained  chi-square  statis¬ 
tic  was  found  to  be  not  significant.  Thus,  the  12  item  test  length  was 
not  significantly  different  from  the  20  item  length  In  terms  of  reliability. 
Moreover,  based  on  these  results  It  would  appear  that  there  were  no  sig¬ 
nificant  differences  between  the  reliabilities  of  the  maximum  likelihood 
and  Bayesian  procedures,  regardless  of  test  length.  That  Is,  the  reliability 
at  a  test  length  of  12  items  was  approximately  the  same  for  the  Bayesian 
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tailored  tests  as  it  was  for  the  mcuimum  likelihood  tailored  tests,  even 
though  12  items  was  selected  as  the  optimal  cutoff  for  the  maximum  like¬ 
lihood  tailored  tests.  Also,  although  the  12  and  14  item  cutoffs  were 
determined  using  the  winter  semester  data,  the  reliabilities  obtained  at 
those  test  lengths  for  the  sunnier  data  were  about  the  same.  However,  it 
should  be  remembered  that  these  reliabilities  were  obtained  using  small 
sample  sizes,  so  a  large  difference  was  needed  for  significance. 

Table  8 

Bayesian  and  Maximum  Likelihood  Tailored  Test  Reliabilities 
for  Winter  and  Sumer  Using  Ability  Estimates  and 
Estimated  True  Scores 


Test  Estimate 

Winter 

Summer 

20  Item 

14  Item 

12  Item 

20  Item 

14  Item 

12  Item 

Bayesian  Ability 

.914 

.919 

.866 

.963 

.929 

.905 

True  Score 

.885 

.900 

.830 

.946 

.881 

.855 

Max.  Like.  Abt’1ty 

.925 

.865 

.943 

.908 

.748 

.777 

True  Score 

.899 

.820 

.936 

.921 

.875 

.839 

Note.  Sample  sizes  for  computation  of  reliabilities  were  n=16  for  the  winter 
semester  Bayesian  reliabilities,  n=13  for  the  summer  semester  Bayesian 
reliabilities,  n=18  for  the  winter  semester  maximum  likelihood  reliab¬ 
ilities,  and  n-23  for  the  summer  semester  maximum  likelihood  reliabili¬ 
ties. 

Total  Test  Information 


The  mean  total  test  information  at  the  ability  estimates  obtained  for 
the  Bayesian  and  maximum  likelihood  tailored  tests  at  the  12,  14,  and  20 
item  test  lengths  for  both  the  winter  and  summer  semesters  are  shown  in 
Table  9.  It  was  expected  that  the  mean  total  test  informations  for  both 
the  Bayesian  and  the  maximum  likeliliood  tests  would  be  greater  for  the 
winter  semester  than  for  the  summer  semester.  This  was  expected  because 
it  had  already  been  determined  that  the  ability  estimates  for  the  summer 
semester  were  significantly  higher  than  the  ability  estimates  for  the 
winter  semester.  This  would  have  resulted  In  items  with  greater  b-values 
being  selected  during  the  summer  tests.  Since  fewer  items  were  available 
farther  away  from  the  center  of  the  item  pool ,  the  total  information  in 
that  region  of  the  pool  would  be  lower.  In  addition,  the  fewer  items 
available  would  result  in  greater  mismatching  of  ability  estimates  and 
item  difficulty,  which  would  also  lower  total  test  Information.  In  order 
to  confirm  this,  a  three-way  ANOVA  was  run  using  the  20  Item  mean  total 
test  Information  as  the  dependent  measure,  with  semester,  session,  and 
test  type  (Bayesian  or  maximum  likelihood)  as  Independent  variables.  The 
session  variable  was  a  repeated  measure.  The  results  of  this  ANOVA  are 
summarized  In  Table  10.  As  Indicated  In  Table  10,  an  F-6.11  (£<.05)  was 
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obtained  for  the  semester  main  effect,  indicating  that  the  mean  total 
test  informations  for  the  winter  semester  were  higher  than  the  mean 
total  test  informations  for  the  summer.  Thus, the  results  of  the  ANOVA 
on  the  mean  total  test  informations  are  consistent  with  the  prediction 
based  on  the  finding  that  the  ability  estimates  were  higher  for  the  sum 
mer  than  for  the  winter. 


Table  9 

Mean  Total  Test  Information  for  Bayesian  and 
Maximum  Likelihood  Tailored  Tests  for  Winter  and  Summer 


Semester 

Session 

Bayesian 

Maximum  Likelihood 

20  Item 

14  Item 

12  Item 

20  Item 

14  Item 

12  Item 

1 

40.89 

30.83 

26.62 

38.20 

27.89 

24.64 

Winter 

2 

41.33 

31.61 

27.61 

36.98 

27.60 

23.98 

combined 

41.11 

37.59 

1 

38.00 

29.35 

26.13 

33.95 

25.84 

22.56 

Summer 

2 

37.49 

29.09 

25.67 

33.29 

24.79 

21.62 

combined 

37.74 

33.62 

Table  10 

Results  of  Three-Way  ANOVA  on  20  Item  Mean  Total  Test 

Informations  Using  Semester  and  Test  as  Independent 

Variables  with  Repeated  Measures 

over  Sessions 

Source 

SS 

df 

MS 

F 

£ 

Test 

494.71 

1 

494.71 

6.63 

0.012 

Semester 

455.64 

1 

455.64 

6.11 

0.016 

Test  x  Semester 

3.14 

1 

3.14 

0.04 

0.838 

Error 

4997 .47 

67 

74.59 

Session 

8.04 

1 

8.04 

2.22 

0.141 

Session  x  Test 

6.90 

1 

6.90 

1.90 

0.172 

Session  : 

x  Semester 

0.33 

1 

0.33 

0.09 

0.764 

Session  x  Test  x  Semester  4.77 

1 

4.77 

1.32 

0.255 

Error 

243.05 

67 

3.63 

Table  9  shows  that  the  mean  total  test  Informations  from  the  Bayesian 
tests  were  higher  than  the  mean  total  test  Informations  from  the  maximum 
likelihood  tests  at  all  test  lengths.  The  ANOVA  summarized  in  Table  10 
Indicates  that  this  difference  is  significant,  since  an  Fs6.63  (£<.05) 
was  obtained  for  the  test  main  effect.  This  result  may  Tndicate  that  the 
mean  Bayesian  ability  estimate  was  less  than  the  mean  maximum  likelihood 
ability  estimate  since,  as  was  previously  pointed  out,  the  information 
per  item  was  lower  for  items  farther  away  from  the  center  of  the  item 
pool.  In  order  to  further  compare  the  total  test  information  yielded  by 
the  two  procedures,  the  total  test  information  for  the  two  procedures 
were  plotted  on  the  same  set  of  axes.  These  plots  are  shown  in  Figure  11. 


FIGURE  11 

TOTAL  TEST  INFORMATION 
FOR  THE  BAYESIAN  AND  MAXIMUM 
LIKELIHOOD  TAILORED  TESTS 
COMBINED  OVER  SEMESTER  AND 
SESSIONS 
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As  can  be  seen  in  Figure  11,  the  test  information  functions  for  the 
Bayesian  and  maximum  likelihood  procedures  were,for  all  practical  pur¬ 
poses,  the  same.  The  Bayesian  curve  is  shifted  toward  the  lower  end  of 
the  ability  scale  relative  to  the  maximum  likelihood  curve,  however, 
indicating  that  the  Bayesian  ability  estimates  fall  in  a  slightly  lower 
region  of  the  scale.  This  result  will  be  amplified  in  the  next  section. 

Ability  Estimates 

A  sumnary  of  the  descriptive  statistics  for  the  ability  estimate 
distributions  for  the  Bayesian  tailored  tests  at  the  12,  14,  and  20  item 
test  lengths  for  both  sessions  of  the  winter  semester  study  is  shown  in 
Table  11.  Table  12  contains  the  same  data  for  the  summer  study.  The 
summary  statistics  for  the  maximum  likelihood  tailored  test  ability 
estimate  distributions  at  the  12,  14,  and  20  item  test  lengths  for  the 
winter  and  summer  semesters  are  shown  in  Table  13  and  14,  respectively. 
Plots  of  the  20  item  ability  estimate  distributions  were  shown  earlier 
in  Figures  3  through  6.  An  ANOVA  previously  discussed  in  conjunction 
with  these  figures  indicated  that  the  summer  semester  ability  estimates 
were  significantly  higher  than  the  winter  semester  ability  estimates. 

A  comparison  of  the  means  presented  in  Tables  11  through  14  also  indicates 
that  the  maximum  likelihood  ability  estimates  were  higher  than  the 
Bayesian  ability  estimates.  In  order  to  confirm  this  fact,  a  four-way 
ANOVA  was  run  on  the  ability  estimates  using  semester,  test,  session  and 
test  length  as  independent  variables,  with  repeated  measures  over  sessions 
and  test  lengths.  Recall  that  the  plot  of  total  test  information  indi¬ 
cated  that  there  might  be  a  significant  difference  in  the  two  sets  of 
ability  estimates. 


Table  11 

Descriptive  Statistics  for  the  Bayesian 
Ability  Estimate  Distributions  for  the  Whiter  Semester 


Statistic 

Session  1 

Session  2 

20  Item 

14  Item 

12  Item 

20  Item 

14  Item 

12  Item 

N 

16 

16 

16 

16 

16 

16 

Mean 

.746 

.650 

.643 

.859 

.800 

.833 

St.  Dev. 

.452 

.455 

.503 

.419 

.401 

.394 

Skewness 

.969 

.942 

.616 

1.021 

1.084 

1.004 

Kurtosis 

.416 

.916 

-.147 

-.135 

.293 

.030 

Minimum 

.181 

.003 

-.141 

.497 

.391 

.388 

Maximum 

1.818 

1.764 

1.723 

1.775 

1.740 

1.693 

Table  12 

Descriptive  Statistics  for  the  Bayesian 
Ability  Estimate  Distributions  for  the  Summer  Semester 


Statistic 

Session  1 

Session  2 

20  Item 

14  Item 

12  Item 

20  Item 

14  Item 

12  Item 

N 

13 

13 

13 

13 

13 

13 

Mean 

1.183 

1.115 

1.070 

1.178 

1.178 

1.156 

St.  Dev. 

.554 

.507 

.477 

.584 

.584 

.597 

Skewness 

.037 

.230 

.466 

.425 

.425 

.109 

Kurtosis 

-.933 

-.643 

-.272 

-.216 

-.216 

-.263 

Minimum 

.387 

.316 

.330 

.191 

.191 

.071 

Maximum 

2.049 

1.967 

1.960 

2.256 

2.256 

2.183 

Table  13 

Descriptive  Statistics  for  the  Maximum  Likelihood 

Ability  Estimate  Distributions  for  the  Winter  Semester 

Statistic 

Session  1 

Session  2 

20  Item 

14  Item 

12  Item 

20  Item 

14  Item 

12  Item 

N 

18 

18 

18 

18 

18 

18 

Mean 

1.255 

1.280 

1.314 

1.296 

1.312 

1.336 

St,  Dev. 

.332 

.427 

.408 

.378 

.435 

.470 

Skewness 

-.184 

-.227 

.037 

-.029 

.003 

.243 

Kurtosis 

-.349 

-.969 

-1.190 

-1.035 

-1.191 

-.940 

Minimum 

.649 

.539 

.699 

.750 

.626 

.571 

Maximum 

1.913 

1.849 

2.011 

1.954 

2.052 

2.112 
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Table  14 

Descriptive  Statistics  for  the  Maximum  Likelihood 
Ability  Estimate  Distributions  for  the  Summer  Semester 


Statistic 

Session  1 

Session  2 

20  Item 

14  Item 

.  12  Item 

20  Item 

14  Item 

12  Item 

N 

23 

23 

23 

23 

23 

23 

Mean 

1.535 

1.517 

1.545 

1.563 

1.698 

1.701 

St.  Dev. 

.511 

.554 

.525 

.580 

.952 

.789 

Skewness 

-.149 

-.103 

-.166 

.283 

3.054 

2.610 

Kurtosis 

.343 

.070 

-4.26  0 

-.075 

12.050 

9.398 

Minimum 

.457 

.483 

.665 

.532 

.576 

.727 

Maximum 

2.634 

2.592 

2.537 

2.887 

5.500 

4.700 

The  results  of  the  four-way  ANOVA  on  ability  estimates  are  summarized 
in  Table  15.  As  can  be  seen  in  Table  15,  the  test  main  effect  was  signi¬ 
ficant  (£=15.43,  £<.0l),  indicating  that  the  maximum  likelihood  ability 
estimates  were  significantly  higher  than  the  Bayesian  ability  estimates. 
Thus,  the  hypothesis  formulated  on  the  basis  of  the  information  analyses 
was  confirmed.  The  significance  of  the  semester  main  effect  (£=8.33, 
p<.01)  is  further  evidence  supporting  the  conclusion  that  the  summer  study 
ability  estimates  were  significantly  higher  than  the  winter  study  ability 
estimates.  This  was  true  for  both  procedures,  as  indicated  by  the  non¬ 
significance  of  the  semester  x  test  interaction. 

The  significance  of  the  session  main  effect  (£-7.50,  p< .01 )  Indicates 
that  the  second  session  ability  estimates  were  significantly  higher  than 
the  first  session  ability  estimates.  The  lack  of  significance  of  the  in¬ 
teraction  of  session  with  test  indicates  that  the  second  session  ability 
estimates  were  significantly  higher  than  the  first  session  ability  esti¬ 
mates  for  both  procedures. 
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Table  15 

Results  of  Four-Way  ANOVA  on  Ability  Estimates 
Using  Semester,  Test,  Session,  and  Test  Length  as  Indepo  A  nt  Variables 
with  Repeated  Measures  over  Sessions  and  Test  Lergth 


Source 

SS 

df 

MS 

F 

E. 

Semester 

12.69 

1 

12.69 

8.33 

.005 

Test 

23.49 

1 

23.49 

15.43 

.000 

Semester  x  Test 

.27 

1 

.27 

.18 

.675 

Error 

100.50 

66 

1.52 

Session 

1.01 

1 

1.01 

7.50 

.008 

Session  x  Semester 

.00 

1 

.00 

.00 

.992 

Session  x  Test 

.02 

1 

.02 

.15 

.699 

Session  x  Semester  x  Test 

.13 

1 

.13 

.97 

.328 

Error 

8.92 

66 

.14 

Length 

.04 

2 

.02 

.61 

.546 

Length  x  Semester 

.01 

2 

.00 

.11 

.895 

Length  x  Test 

.37 

2 

.19 

6.39 

.002 

Length  x  Semester  x  Test 

.04 

2 

.02 

.72 

.488 

Error 

3.83 

132 

.03 

Session  x  Length 

.06 

2 

.03 

1.63 

.199 

Session  x  Length  x  Semester 

.01 

2 

.01 

.32 

.728 

Session  x  Length  x  Test 

.02 

2 

.01 

.46 

.630 

Session  x  Length  x  Semester 

x  Test  .06 

2 

.03 

1.43 

.242 

Error 

2.54 

132 

.02 

The  test  length  main  effect  was  not  significant.  However,  the  inter¬ 
action  of  test  type  with  test  length  was  significant  (F=6.39,  jk.OI).  In 
order  to  explore  this  effect  Fisher's  LSD  test  was  appTied  to  the  mean 
ability  estimates  at  the  different  test  lengths.  When  the  LSD  test  was 
applied  to  the  maximum  likelihood  mean  ability  estimates  no  significant 
differences  were  found.  For  the  winter  semester  Bayesian  mean  ability 
estimates,  a  value  of  LSD*. 073  at  a*. 05  was  obtained.  Comparisons  of 
this  value  with  the  differences  in  means  indicated  that  the  12  item' 
and  20  item  mean  ability  estimates  were  significantly  different,  while 
the  other  pairings,  12  item  with  14  item  and  14  item  with  20  item,  were 
not  significantly  different. 

The  results  of  these  LSD  tests  are  consistent  with  previously  reported 
results.  The  mean  ability  of  the  winter  group  was  closer  to  the  mean  of 
the  assumed  prior  distribution  than  was  the  mean  ability  of  the  summer 
group.  As  a  result,  the  effect  of  the  low  prior  may  have  been  overcome 


by  the  12  item  level  for  the  winter  semester.  The  12  item  and  14  item 
mean  ability  estimates  were  not  significantly  different,  nor  were  the 
12  item  and  20  item  mean  ability  estimates.  This  finding  is  an  anomaly 
for  which  no  explanation  could  be  found.  Reckase  (1974)  found  that  con¬ 
tinuation  of  a  tailored  test  beyond  the  optimal  test  length  introduces 
bias  into  the  ability  estimates.  From  the  convergence  plots  it  appears 
that  this  was  the  case  here.  For  the  summer  semester  the  significance 
of  the  difference  between  the  12  item  and  20  item  mean  ability  estimates 
perhaps  indicates  that  the  effect  of  the  prior  distribution  was  not  over¬ 
come  by  12  items,  but  rather  that  the  ability  estimates  continued  to 
increase  beyond  the  twelfth  item.  The  lack  of  significance  of  the  dif¬ 
ference  between  the  12  item  and  14  item  mean  ability  estimates  may  just 
be  an  indication  that  the  increase  in  ability  estimates  was  too  gradual 
for  two  items  to  make  a  significant  difference.  This  explanation  is,  of 
course,  only  one  possibility.  Other  reasonable  explanations  may  be  found. 

The  final  set  of  analyses  run  on  the  ability  estimates- involved  an 
investigation  of  the  interaction  of  the  ability  estimation  procedures  and 
item  selection  procedures.  Because  the  two  tailored  testing  techniques 
utilized  different  item  selection  procedures,  any  differences  in  the 
ability  estimates  obtained  from  the  techniques  could  have  been  due  to  dif¬ 
ferences  in  the  ability  estimation  procedures,  differences  in  the  item 
selection  procedures,  or  both.  In  order  to  determine  the  source  of  the 
differences  in  ability  estimates,  maximum  likelihood  ability  estimates 
were  obtained  using  the  items  selected  by  the  Bayesian  tailored  testing 
procedure,  and  the  Bayesian  ability  estimates  were  obtained  using  the 
items  selected  by  the  maximum  likelihood  tailored  testing  procedure. 

These  new  ability  estimates  were  analyzed  with  a  three-way  ANOVA  using 
ability  estimation  procedure,  item  selection  procedure,  and  session  as 
independent  variables,  with  repeated  measures  over  sessions.  The  results 
of  this  analysis  are  reported  in  Table  16  and  the. mean  ability  estimates 
obtained  are  reported  in  Table  17. 

Table  16 

Three-Way  ANOVA  on  Recalculated  Ability  Estimates 
Using  Item  Selection  Procedure,  Estimation  Procedure,  and 
Session  as  Independent  Variables,  with  Repeated 
Measures  over  Sessions 


Source 

SS 

df 

MS 

F 

E 

Selection 

10.20 

1 

10.20 

11.85 

.001 

Error 

59.40 

69 

.86 

Session 

.80 

1 

.80 

7.24 

.009 

Session  x  Selection 

.01 

1 

.01 

.10 

.758 

Error 

7.57 

69 

.11 

Estimation 

.17 

1 

.17 

29.10 

.000 

Estimation  x  Selection 

.03 

1 

.03 

4.51 

.037 

Error 

.41 

69 

.01 

Session  x  Estimation 

.01 

1 

.01 

5.46 

.022 

Session  x  Estimation  x  Selection 

.01 

1 

.01 

5.08 

.027 

Error 

.13 

69 

.00 
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Table  17 

Means  and  Standard  Deviations  Associated  with  the 
Three-Way  ANOVA  on  Recalculated  Ability  Estimates 


Item 

Ability  Estimation 

Procedure 

Selection 

Procedure 

Statistic 

Bayesian 

Maximum  Likelihood 

Session  1 

Session  2 

Session  1 

Session  2 

Mean 

.954 

1.048 

.984 

1.079 

Bayesian 

Std.  Dev. 

.533 

.516 

.579 

.563 

Mean 

1.317 

1.413 

1.363 

1.507 

Max.  Like. 

Std.  Dev. 

.409 

.433 

.456 

.504 

As  can  be  seen  in  these  tables,  regardless  of  which  procedure  selected 
the  items,  the  maximum  likelihood  ability  estimates  were  significantly 
greater  than  the  Bayesian  ability  estimates  (F=29.10,  £<.01; .  It  is  clear 
from  these  results  that  at  least  part  of  the  differences  found  between 
the  Bayesian  and  maximum  likelihood  tailored  test  ability  estimate  dis¬ 
tributions  was  due  to  differences  in  the  ability  estimation  procedures. 
However,  it  should  also  be  noted  from  Table  16  and  17  that  for  both  pro¬ 
cedures  the  ability  estimates  were  higher  when  based  on  the  maximum  like¬ 
lihood  items  than  when  based  on  the  Bayesian  items  (£=11.85,  £<.01).  It 
is  also  clear,  then,  that  part  of  the  differences  found  between  the  ability 
estimates  obtained  from  the  two  tailored  test  procedures  was  due  to  the 
difference  in  items  selected  for  administration.  One  possible  explana¬ 
tion  for  these  differences  is  that  the  assumption  of  a  prior  distribu¬ 
tion  of  ability  made  by  the  Bayesian  procedure  imposed  a  restriction  on 
the  range  of  ability  estimates  obtained  from  that  procedure,  which  in  turn 
would  restrict  the  range  of  the  b-values  of  the  items  selected.  The 
restriction  of  the  range  of  J>-vaTues  would  have  further  limited  the  range 
of  ability  estimates.  Thus,  there  may  have  been  an  Interaction  of 
item  selection  and  ability  estimation  procedures  that,  due  to  an  inap- 
priately  low  prior,  limited  the  magnitude  of  the  resulting  ability  esti¬ 
mates.  This  is  supported  by  the  finding  that  the  estimation  procedure 
x  Item  selection  procedure  interaction  was  significant  (£=4.51,  p  .05). 

The  session  main  effect  reported  in  Table  16  was  also  significant 
(£*7.24,  £4.0i),  as  was  the  session  x  estimation  procedure  interaction 
(F=5 .46,  p< .05) .  The  slgnlflcante  of  the  session  main  effect  was  con¬ 
sistent  with  previously  reported  findings.  The  significance  of  the 
session  x  estimation  procedure  Interaction  was  probably  due  to  the  res¬ 
triction  in  the  range  of  the  Bayesian  ability  estimates.  The  three-way 
interaction  among  session,  estimation  procedure,  and  item  selection 
procedure  (£=5.08,  £<.05)  is  difficult  to  interpret. 

In  order  to  further  Investigate  the  effect  of  the  prior  distribution 
on  the  obtained  Bayesian  ability  estimates  an  additional  analysis  was 
performed.  This  analysis  Involved  obtaining  Bayesian  ability  estimates 
using  both  sets  of  tailored  test  items,  but  using  a  prior  with  a  mean 
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of  2.0,  as  opposed  to  the  mean  of  -.100  or  .150  originally  employed. 

These  new  ability  estimates  were  also  analyzed  using  a  three-way  ANOVA ,  but 
this  time  with  prior  distribution  in  place  of  ability  estimation  pro¬ 
cedure  as  the  third  independent  variable.  The  results  of  this  analysis 
are  reported  in  Table  18  and  the  obtained  means  are  reported  in  Table 
19.  As  can  be  seen  in  Tables  18  and  19  the  new  Bayesian  ability  esti¬ 
mates  were  significantly  higher  using  the  high  prior  than  when  using  the 
low  prior  for  both  item  selection  procedures.  The  prior  distribution 
main  effect  had  an  F=91.84,  £<.01,  while  the  prior  x  selection  inter¬ 
action  was  not  significant.  As  can  be  seen  from  the  means  reported  in 
Table  19,  use  of  the  high  pri-or  increased  the  Bayesian  ability  estimates 
using  the  Bayesian  tailored  test  items,  but  not  to  the  level  of  the 
Bayesian  ability  estimates  using  the  maximum  likelihood  tailored  test 
items.  This  is  supported  by  the  significance  of  the  selection  main 
effect  (£=10.04,  £<.01).  It  should  be  remembered  that  for  the  Bayesian 
tests  using  the  high  prior  the  items  were  still  those  selected  when 
the  low  prior  was  being  used.  As  a  result,  all  of  the  items  were  too 
easy  for  the  ability  estimates  obtained  using  a  high  prior.  Thus,  when 
an  item  was  correctly  answered  the  ability  estimate  would  have  increased 
only  minimally.  When  an  item  was  answered  incorrectly,  on  the  other 
hand,  the  low  b-values  would  have  resulted  in  a  large  decrease  in  the 
ability  estimate.  That  is,  the  b-values  pulled  the  ability  estimates 
down  close  to  that  level  for  which  the  items  had  been  selected.  The 
effect  of  the  high  prior,  then,  was  to  increase  the  ability  estimates 
only  a  small,  though  statistically  significant,  amount. 


Table  18 

Three-Way  ANOVA  on  Recalculated  Bayesian  Ability  Estimates  Using 
Item  Selection  Procedure,  Prior  Distribution,  and  Session  as. 
Independent  Variables,  with  Repeated  Measures  over  Sessions 


Source 

SS 

df 

MS 

F 

fi. 

Selection 

9.34 

1 

9.34 

10.04 

.002 

Error 

64.22 

69 

.93 

Session 

.15 

1 

.15 

3.78 

.056 

Session  x  Selection 

.13 

1 

.13 

3.18 

.079 

Error 

2.71 

69 

.04 

Prior 

.38 

1 

.35 

91.84 

.000 

Prior  x  Selection 

.00 

1 

.00 

.43 

.513 

Error 

.28 

69 

.00 

Session  x  Prior 

.00 

1 

.00 

.49 

.484 

Session  x  Prior  x  Selection 

.00 

1 

.00 

.03 

.858 

Error 

.48 

69 

.01 
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Table  19 

Means  and  Standard  Deviations  Associated  with  the  Three- 
Way  ANOVA  on  Recalculated  Bayesian  Ability  Estimates 


Item 

Selection 

Procedure 

Statistic 

Prior  Distribution 

How 

High 

Session  1 

Session  2 

Session  1 

Session  2 

Mean 

.954 

1.037 

1.017 

1.111 

Bayesian 

Std.  Dev. 

.533 

.521 

.546 

.582 

Mean 

1.360 

1.355 

1.430 

1.443 

Max.  Like. 

Std.  Dev. 

.408 

.469 

.454 

.485 

This  result  serves  to  point  out  the  serious  effect  of  an  inappropri¬ 
ate  prior.  Selecting  a  prior  and  selecting  items  on  the  basis  of  that 
prior  forces  item  b-values  to  remain  in  the  region  of  the  prior  mean. 
Because  only  items  with  b-values  in  that  region  are  administered,  sub¬ 
sequent  ability  estimates  are  also  forced  to  remain  in  the  region  of  the 
prior  mean.  The  results  of  the  ANOVA  and  LSD  tests  on  the  ability  esti¬ 
mates  indicate  than  an  inappropriate  prior  may  eventually  be  overcome, 
but  it  may  significantly  increase  the  length  of  the  tailored  test  that 
is  required.  Moreover,  the  different  results  of  the  LSD  test  for  the 
two  semesters  point  out  that  the  appropriateness  of  the  prior  must  be 
determined  for  every  distinct  group  of  examinees. 

It  may  be  true  that  the  Bayesian  ability  estimates  obtained  using 
the  high  prior  were  still  smaller  than  the  maximum  likelihood  ability 
estimates  because  the  estimates  were  obtained  using  items  with  inap¬ 
propriately  low  b-values.  If  so,  the  fact  that  the  Bayesian  ability 
estimates  obtained  using  the  maximum  likelihood  items  were  higher  than 
when  the  Bayesian  items  were  used  would  indicate  that  the  maximum  like¬ 
lihood  procedure  administered  items  with  higher  b-values  than  did  the 
Bayesian  procedure.  A  comparison  of  the  mean  b-value  for  the  items  ad¬ 
ministered  by  the  Bayesian  procedure  (.678)  with  the  maximum  llcelihood 
mean  b-value  (.903)  yielded  a  £=11.45,  £<.001.  Clearly,  then,  the  maxi¬ 
mum  lTkelihood  procedure  adminTstered  items  with  greater  b-values  than 
did  the  Bayesian  procedure.  This  result  supports  the  hypotheses  set 
out  above. 

Items  Administered 


As  was  discussed  in  the  previous  section,  the  maximum  likelihood 
procedure  tended  to  administer  items  with  higher  b-values  than  did  the 
Bayesian  procedure.  Further  comparisons  indicated  that  the  mean  a-value 
of  the  items  administered  by  the  maximum  likelihood  procedure  (1.766) 
was  significantly  greater  than  the  Bayesian  mean  £-value  (1.749),  yield¬ 
ing  a  £=2.183,  £<.05.  The  mean  c-values  were  not  significantly  different. 


***«»»■  wmm* 
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It  was  also  found  that  there  were  operational  differences  in  the 
two  item  selection  procedures.  It  was  found  that  the  b- value  of  an  item 
selected  by  the  Bayesian  tailored  test  procedure  was  more  closely  related 
to  the  current  ability  estimate  than  was  the  b^value  of  an  item  selected 
by  the  maximum  likelihood  procedure.  The  correlation  of  the  item  b-values 
with  the  ability  estimates  used  to  select  the  items  was  £=.77  for  lihe 
Bayesian  procedure,  £=.61  for  the  maximum  likelihood  procedure.  A  com¬ 
parison  of  these  correlations  using  Fisher's  £  to  £  transformation  re¬ 
sulted  in  a  £=3.31,  p<.01.  There  were  no  significant  differences  in 
the  correlations  of  the  aj-values  and  ^-values  with  the  ability  estimates. 

In  order  to  determine  whether  the  difference  in  entry  point  into 
the  item  pool  between  sessions  affected  one  procedure  more  than  another, 
an  analysis  was  performed  to  discover  whether  one  procedure  had  more  items 
in  common  between  sessions  than  the  other  procedure.  This  analysis  was 
also  used  to  compare  the  similarity  of  items  over  sessions  of  the  two 
procedures. 

For  the  Bayesian  tailored  tests  the  proportion  of  items  administered 
in  the  first  session  that  were  repeated  in  the  second  session  was  £=.827. 

For  the  maximum  likelihood  procedure  the  proportion  of  repeated  items 
was  p=.848.  A  comparison  of  these  two  proportions  to  determine  whether 
the  difference  was  significant  yielded  a  z=1.07,  which  was  not  signifi¬ 
cant.  It  is  seen  from  this  result  that  tfTe  two  procedures  were  equally 
consistent  in  the  items  that  were  administered  across  sessions.  Also, 
both  procedures  tended  to  use  only  about  a  third  of  the  items  in  the  item  pool. 

Goodness  of  Fit 


From  the  analyses  reported  previously  the  conclusion  was  reached 
that  the  Bayesian  procedure  was  producing  ability  estimates  that  were 
perhaps  too  low.  If  that  were  true  then  the  probability  of  a  correct  re¬ 
sponse  to  a  given  item  computed  from  the  3PL  model  using  those  ability 
estimates  would  also  be  too  low.  This  would  be  reflected  in  poorer  fit 
of  the  model  to  the  data  when  using  the  Bayesian  procedure,  which  should 
have  been  detected  by  the  comparison  of  the  MSD  statistic  obtained  for 
the  two  procedures.  This  was  the  case.  The  MSD  statistic  obtained  for 
the  maximum  likelihood  procedure  was  MSD=.244,  and  the  value  obtained  for 
the  Bayesian  procedure  was  MSD*. 266.  A  comparison  of  these  two  values 
yielded  a  t=5.64,  £<.01,  indicating  that  the  Bayesian  procedure  yielded 
significanFly  poorer  fit  than  the  maximum  likelihood  procedure. 

Descriptive  Statistics 

The  first  descriptive  statistic  compiled  for  the  procedures  was 
the  average  test  difficulty  measured  as  the  proportion  of  items  answered 
correctly.  To  analyze  these  proportion  correct  values  a  three-way  ANOVA 
was  run  using  semester,  test,  and  session  as  independent  variables,  with 
repeated  measures  over  sessions.  In  order  to  meet  the  assumption  of  nor¬ 
mality  made  by  the  ANOVA,  the  proportion  correct  values  were  first  trans¬ 
formed  using  the  arc  sine  transformation.  The  results  of  this  ANOVA 
are  summarized  In  Table  20.  As  can  be  seen  In  the  table,  both  the  semes¬ 
ter  and  test  main  effects  were  significant  (F»6.55,  £<.05  for  semester; 
£■5.53,  £<.05  for  test).  The  means  and  standard  deviations  for  this 
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ANOVA  are  reported  in  Table  21.  An  examination  of  Table  21  shows  that 
the  summer  tests  were  easier  for  the  examinees  than  were  the  winter  tests. 
The  nonsignificance  of  the  semester  x  test  interaction  indicates  that 
this  was  true  for  both  procedures.  The  session  main  effect  was  not  sig¬ 
nificant,  indicating  that  neither  session  was  easier  than  the  other  for 
either  procedure  or  semester.  However,  these  tables  do  indicate  that 
the  maximum  likelihood  tests  were  easier  for  the  examinees  than  were  the 
Bayesian  tests.  This  finding  appears  to  be  contrary  to  the  expected  re¬ 
sult.  If  the  Bayesian  procedure  were  administering  items  with  inappro¬ 
priately  low  b-values,  as  was  previously  hypothesized,  the  Bayesian  tests 
would  have  been  easier  for  the  examinees  than  were  the  maximum  likeli¬ 
hood  tests.  The  finding  that  the  maximum  likelihood  tests  were  easier 
indicates  that  at  least  some  part  of  the  difference  in  ability  estimates 
obtained  from  the  two  procedures  was  due  to  actual  differences  in  vocab¬ 
ulary  ability.  However,  the  ability  estimate  analysis  investigating  the 
effect  of  the  prior  on  subsequent  ability  estimation  and  item  selection 
clearly  demonstrates  that  not  all  of  the  differences  in  ability  esti¬ 
mates  were  due  to  differences  in  group  ability. 

An  analysis  of  the  test  difficulty  separately  for  each  semester  is 
revealing.  It  was  hypothesized  previously  that  the  Bayesian  procedure 
actually  overcame  the  inappropriate  prior  for  the  winter  semester 
examinees,  and  that  the  ability  estimates  leveled  off  somewhat  at  a  level 
substantially  below  the  level  of  the  maximum  likelihood  ability  estimates. 
This  indicates  an  actual  difference  in  ability.  This  is  supported  by 
a  comparison  of  the  mean  test  difficulties  for  the  two  procedures  for 
the  winter  semester,  which  yields  a  t=2.47,  £<.05.  For  the  summer 
semester  it  was  suggested  that  the  inappropriate  prior  may  not  have 
been  overcome.  The  summer  Bayesian  ability  estimates  were  significantly 
lower  than  the  maximum  likelihood  ability  estimates,  but  were  increasing 
with  increased  test  length.  Had  the  Bayesian  tailored  tests  been  suf¬ 
ficiently  long  to  overcome  the  effect  of  the  prior  it  is  possible  that 
the  Bayesian  ability  estimates  would  have  approached  the  level  of  the 
maximum  likelihood  ability  estimates,  it  is  likely,  then,  that  there 
was  considerably  less  difference  between  the  group  vocabulary  abilities 
for  the  summer  semester  than  for  the  winter  semester.  This  is  supported 
by  the  finding  that  the  difference  in  mean  proportion  correct  values  for 
the  two  procedures  was  not  significant  for  the  summer  semester.  Had 
the  examinees  taking  the  Bayesian  tests  been  of  the  same  ability  as  the 
maximum  likelihood  examinees,  the  Bayesian  tailored  tests  would  have  been 
easier  than  the  maximum  likelihood  tests. 
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Table  20 

Three-Way  ANOVA  on  Test  Difficulties  Using  Semester, 
Test,  and  Session  as  Independent  Variables,  with 
Repeated  Measures  over  Sessions 


Source 

SS 

Semester 

440.14 

Test 

300.08 

Semester  x  Test 

35.72 

Error 

4089.04 

Session 

3.80 

Session  x  Semester 

3.57 

Session  x  Test 

2.21 

Session  x  Semester  x 

Test  5.31 

Error 

378.29 

df 

MS 

F 

£ 

1 

440.14 

7.10 

.001 

1 

300.08 

4.84 

.031 

1 

35.72 

.58 

.450 

66 

61.96 

1 

3.80 

.66 

.419 

1 

3.57 

.62 

.433 

1 

2.21 

.39 

.537 

1 

5.31 

.93 

.339 

66 

5.73 

Table  21 

Means  and  Standard  Deviations  of  Test  Difficulties  for 
Both  Sessions  of  the  Bayesian  and  Maximum 
Likelihood  Tailored  Tests  for  Both  Semesters 


Semester 

Statistic 

Bayesian 

Maximum  Likelihood 

Session  1 

Session  2 

Session  1 

Session  2 

Winter 

Mean 

.69(56.24) 

.70(56.76) 

.75(60.93) 

.76(60.93) 

Std.  Dev. 

.09(  5.59) 

. 07 (  4.67) 

.05(  3.60) 

.07 {  4.35) 

Summer 

Mean 

.75(60.83) 

.77(61.49) 

.79(63.44) 

.78(62.79) 

Std.  Dev. 

.10(  7.46) 

.09(  6.55) 

. 08(  6.11) 

.10(  7.15) 

Note:  Values  in  parentheses  represent  the  results  of  the  arc  sine  trans¬ 
formations. 
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Another  statistic  compiled  for  the  two  procedures  was  the  mean 
testing  time  for  the  20  item  test,  measured  in  seconds.  The  means  and 
standard  deviations  for  both  sessions  of  the  two  procedures  for  both 
semesters  are  shown  in  Table  22.  Table  23  summarizes  the  results  of  a 
three-way  ANOVA  on  the  testing  times  using  semester,  test,  and  session 
as  independent  variables,  with  repeated  measures  over  sessions.  As 
can  be  seen  in  these  tables,  the  tailored  tests  in  the  summer  study 
took  significantly  longer  than  did  the  winter  semester  tailored  tests 
(£=11.71,  £<.01).  This  was  true  for  both  procedures,  since  the  semester 
x  test  interaction  was  not  significant.  The  test  main  effect  was  not 
significant,  indicating  that  there  were  no  significant  differences  in 
the  amount  of  time  the  two  types  of  tailored  tests  lasted.  The  session 
main  effect  was  significant  (£=5.96,  g<.05),  with  the  second  session 
tests  ending  more  quickly  than  the  first  session  tests  in  all  cases 
except  the  suaroer  maximum  likelihood  condition.  The  session  x  semester 
interaction  was  significant  (£=5.97,  £<.05),  with  the  difference  between 
the  two  sessions  being  larger  for  the  winter  semester. 

Table  22 

Means  and  Standard  Deviations  of  Testing  Time  in  Seconds  for 
Both  Sessions  of  the  Bayesian  and  Maximum 
Likelihood  Tailored  Tests  for  Both  Semesters 


Semester 

Statistic 

Bayesian 

Maximum  Likelihood 

Session  1 

Session  2 

Session  1 

Session  2 

Winter 

Mean 

607.88 

498.12 

566.72 

480.78 

Std.  Dev. 

176.02 

102.69 

94.02 

106.13 

Summer 

Mean 

715.92 

707.62 

601.39 

609.78 

Std.  Dev. 

256.59 

222.57 

154.37 

181.94 

Table  23 

Three-Way  ANOVA  on  Testing  Time  Using  Semester, 
Test,  and  Session  as  Independent  Variables,  with 
Repeated  Measures  over  Sessions 


Source 

SS 

df 

MS 

£ 

fi. 

Semester 

483310.04 

1 

483310.04 

11.71 

.001 

Test 

152578.77 

1 

152578.77 

3.70 

.059 

Semester  x  Test 

50360.16 

1 

50360.16 

1.22 

.273 

Error 

2723098.41 

66 

41259.07 

Session 

79322.81 

1 

79322.81 

5.96 

.017 

Session  x  Semester 

79459.27 

1 

79459.27 

5.97 

.017 

Session  x  Test 

3631.40 

1 

3631.40 

.27 

.603 

Session  x  Semester  x  Test  141.60 

1 

141.60 

.01 

.918 

Error 

878721.87 

66 

13313.97 

Latency 


The  correlations  obtained  between  the  mean  item  response  latencies 
for  a  person  and  their  ability  estimates  for  both  sessions  of  both  semes¬ 
ters  are  shown  in  Table  24.  As  can  be  seen  in  the  table,  none  of  the 
correlations  were  significant  for  the  winter  semester.  For  the  summer 
semester  the  correlations  were  significant  for  the  first  session  Bayesian 
tests  (r=-.57,  £<.05  )and  for  both  sessions  of  the  maximum  likelihood 
tests  (r=.Sl,  £<.05  for  the  first  session;  r=-.47,  £<.05  for  the  second 
sessionj. 


Table  24 

Correlations  of  Ability  Estimates  and  Mean  Latencies  for 
Both  Sessions  of  the  Bayesian  and  Maximum  Likelihood 
Tailored  Tests  for  the  Winter  and  Summer  Semesters 


Semester 

Bayesian 

Maximum 

Likelihood 

Session  1 

Session  2 

Session  1 

Session  2 

Winter 

-.17 

CM 

O 

1 

.06 

-.18 

Summer 

-.57* 

-.04 

-.51* 

-.47* 

*  £<.05. 


The  final  analysis  performed  was  the  comparison  of  mean  latencies  for 
correct  and  incorrect  responses.  The  results  of  a  four-way  ANOVA  on  the 
mean  latencies  are  summarized  In  Table  25.  The  means  and  standard  devia¬ 
tions  for  this  analysis  appear  in  Table  26.  For  this  analysis  the  Indepen¬ 
dent  variables  were  semester,  test,  session,  and  response  (correct  or  in¬ 
correct).  Session  and  response  were  repeated  measures.  As  can  be  seen 
in  Table  25  the  session  and  response.main  effects  were  significant.  The 
semester  and  test  main  effects  were  not  significant,  nor  were  any  of  the 
interactions.  From  Table  26  it  can  be  seen  that  the  first  session  response 
latencies  were  greater  than  the  second  session  response  latencies.  Also, 
response  latencies  for  the  incorrect  responses  were  greater  than  the  laten¬ 
cies  for  correct  responses.  No  differences  were  found  between  the  two  test 
procedures . 
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Table  25 

Results  of  Four-Way  ANOVA  on  Mean  Response  Latencies 
Using  Semester,  Test,  Session,  and  Response  as 
Independent  Variables  with  Repeated  Measures 
over  Session  and  Response 


Source 

SS 

df 

MS 

F 

R 

Semester 

527.87 

1 

527.87 

3.35 

.072 

Test 

315.29 

1 

315.29 

2.00 

.162 

Semester  x  Test 

75.61 

1 

75.61 

.46 

.474 

Error 

10570.06 

67 

157.76 

Session 

2000.51 

1 

2000.51 

52.42 

.000 

Session  x  Semester 

4.42 

1 

4.42 

.12 

.735 

Session  x  Test 

42.52 

1 

42.52 

1.11 

.295 

Session  x  Semester  x  Test 

17.61 

1 

17.61 

.46 

.499 

Error 

2556.81 

67 

38.16 

Response 

2794.22 

1 

2794.22 

60.72 

.000 

Response  x  Semester 

45.40 

1 

45.40 

.99 

.324 

Response  x  Test 

12.99 

1 

12.99 

.28 

.597 

Response  x  Semester  x  Test 

3.26 

1 

3.26 

.07 

.791 

Error 

3083.28 

67 

46.02 

Session  x  Response 

93.99 

1 

93.99 

3.54 

.064 

Session  x  Response  x  Semester 

4.34 

1 

4.34 

.16 

.687 

Session  x  Response  x  Test 

3.83 

1 

3.83 

.14 

.705 

Session  x  Response  x  Semester  x  Test 

3.15 

1 

3.15 

.12 

.732 

Error 

1777.85 

67 

26.54 

Table  26 

Means  and  Standard  Deviations  of  Response  Latencies  for  Correct  and 
Incorrect  Responses  for  Both  Sessions  of  the  Bayesian  and  Maximum 
Likelihood  Tailored  Tests  for  Both  Semesters 


Bayesian 

Maximum  Likelihood 

j  Id  L  1  o  L  1  L 

Session  1 

Session  2 

Session  1 

Session  2 

Winter 

Correct 

Mean 

13.86 

8.86 

11.80 

8.27 

Std.  Dev. 

7.32 

4.65 

4.15 

2. 71 

Incorrect 

Mean 

20.23 

13.27 

19.51 

12.22 

Std.  Dev. 

11.79 

6.10 

11.02 

5.75 

Summer 

Correct 

Mean 

17.40 

11.82 

12.20 

9.27 

Std.  Dev. 

6.36 

4.32 

4.99 

5.92 

Incorrect 

Mean 

24.89 

17.50 

21.05 

16.23 

Std.  Dev. 

13.45 

8.28 

12.72 

9.91 
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Discussion 


In  order  to  put  this  study  in  the  proper  perspective  it  is  necessary  to 
view  it  as  one  of  a  series  of  studies  designed  to  evaluate  alternative  compo¬ 
nents  for  tailored  testing.  The  series  began  with  several  studies 
designed  to  determine  which  of  the  available  latent  trait  models  was  optimal 
(Koch  and  Reckase,  1978,  1979;  McKinley  and  Reckase,  1980a;  Reckase,  1977). 
Once  a  model  was  selected  (the  3PL  model), a  study  was  done  to  identify  the 
optimal  item  calibration  procedure  to  be  used  with  the  model  (McKinley  and 
Reckase,  1980b).  After  an  item  calibration  procedure  was  selected  for  the 
model  (LOGIST).a  set  of  studies  was  begun  to  determine  what  the  optimal 
operating  characteristics  of  the  tailored  testing  procedure  should  be. 

These  characteristics  included  item  selection  and  ability  estimation  proce<- 
dures,  which  are  the  topics  of  the  current  study. 

Optimal  Cutoffs 

The  convergence  plot  analyses  performed  indicated  that  the  optimal  test 
length  for  the  Bayesian  procedure  was  14  items.  This  result  was  consistent 
across  sessions  and  semesters.  For  the  maximum  likelihood  procedure  the  op¬ 
timal  test  length  was  12  items.  If  cutoffs  are  expressed  in  terms  of  item 
information  and  standard  error  of  estimate,  the  optimal  cutoff  value  for 
the  Bayesian  procedure  was  a  standard  error  of  estimate  of  .25,  and  for  the 
maximum  likelihood  procedure  the  optimal  cutoff  value  was  an  item  informa¬ 
tion  of  1.64.  A  comparison  of  the  optimal  test  lengths  for  the  two  proce¬ 
dures  indicates  that  the  Bayesian  procedure  requires  more  items  to  obtain 
stable  ability  estimates.  This  conclusion  is  supported  by  the  finding  that 
the  test  length  main  effect  was  not  significant  for  the  maximum  likelihood 
ability  estimates  but  was  significant  for  the  Bayesian  ability  estimates. 

This  finding  was  also  consistent  across  semesters. 

Reliabilities 


In  terms  of  reliability,  no  significant  differences  were  found  between 
the  two  procedures.  Moreover,  for  neither  procedure  was  there  any  signi¬ 
ficant  differences  in  reliability  across  the  different  test  lengths.  The 
results  were  the  same  when  reliabilities  were  computed  using  estimated  true 
scores.  It  should  be  pointed  out  here  that  the  relative  instability  of  the 
Bayesian  ability  estimates  did  not  lower  the  reliability  of  the  Bayesian 
tailored  tests  at  the  shorter  test  lengths.  As  was  stated  earlier,  these 
reliabilities  were  obtained  with  relatively  small  sample  sizes,  so  large 
differences  were  necessary  for  significance. 

Total  Test  Information 

At  the  20  Item  level  the  mean  total  test  information  yi  lded  by  the 
Bayesian  procedure  was  significantly  greater  than  the  mean  total  test  in¬ 
formation  yielded  by  the  maximum  likelihood  procedure.  It  is  apparent  from 
these  findings  that  the  Bayesian  procedure  was  yielding  ability  estimates  in 
a  range  where  more  Items  with  high  Information  at  those  ability  estimates 
were  available.  Since  the  mean  of  the  assumed  prior  distribution  of  ability 
was  in  that  region  of  the  ability  scale  for  which  the  item  pool  would  yield 
the  greatest  Information,  this  result  Indicates  that  the  ability  estimates 
yielded  by  the  Bayesian  procedure  tended  to  be  relatively  close  to  the  mean 
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of  the  prior  distribution.  In  that  region  of  the  ability  scale  where 
both  procedures  yielded  ability  estimates  the  Bayesian  procedure  did  not 
appear  to  yield  more  total  test  information.  It  should  be  pointed  out 
that  the  prior  of  the  Bayesian  procedure  held  ability  estimates  in  that 
range  where  there  was  high  information  because  the  mean  of  the  prior  was 
selected  as  an  ability  near  the  mode  of  the  total  information  curve  of 
tne  pool.  Had  the  prior  been  set  higher  the  mean  total  test  information 
for  the  Bayesian  procedure  would  have  decreased.  Thus,  the  high  infor¬ 
mation  of  the  Bayesian  procedure  was  due  to  the  selection  of  the  prior 
and  the  structure  of  the  item  pool. 

Ability  Estimates 

The  four-way  ANOVA  on  the  ability  estimates  confirmed  the  hypothesis 
that  the  maximum  likelihood  ability  estimates  were  significantly  greater 
than  the  Bayesian  ability  estimates.  On  the  basis  of  this  finding  the 
hypothesis  was  formulated  that  the  Bayesian  ability  estimates  were  smaller 
because  the  mean  of  the  assumed  prior  distribution  was  too  low.  That  is, 
the  effect  of  the  prior  distribution  was  to  lower  the  ability  estimates 
obtained  from  the  Bayesian  procedure.  If  these  hypotheses  were  true,  then 
the  Bayesian  ability  estimates  should  have  increased  as  test  length  in¬ 
creased,  since  additional  items  would  give  the  procedure  opportunity  to 
overcome  the  effect  of  the  inappropriate  prior.  The  increase  should  have 
continued  until-  the  prior  was  overcame,  and  then  the  ability  estimates 
should  have  begun  to  stabilize.  Evidence  supporting  this  prediction  was 
obtained  from  the  test  length  analyses.  The  maximum  likelihood  ability 
estimates  did  not  change  significantly  after  12  ftems,  while  for  the 
Bayesian  ability  estimates  the  test  length  effect  was  significant.  For 
the  summer  semester  the  Bayesian  mean  ability  estimates  continued  to  in¬ 
crease  across  the  different  test  lengths.  This  is  an  Indication  that  the 
mean  ability  of  the  summer  examinees  was  sufficiently  higher  than  the  mean 
of  the  assumed  prior  distribution  of  ability  that  the  retarding  effect 
of  the  prior  was  never  completely  overcome,  even  after  20  items.  The  mean 
ability  of  the  winter  semester  examinees  was  lower  than  the  mean  ability 
of  the  summer  group,  and  as  a  result  the  prior  was  more  appropriate,  yielding 
stable  estimates  by  the  twelfth  item.  From  these  results  It  appears  that 
use  of  an  inappropriate  prior  distribution  of  ability  may  have  a  serious 
effect  on  the  ability  estimates  obtained  from  the  Bayesian  procedure,  thus 
affecting  the  length  of  test  required  to  obtain  accurate  estimates. 

The  investigation  into  the  interaction  of  the  ability  estimation  pro¬ 
cedures  and  the  item  selection  procedures  yielded  further  evidence  as  to 
the  restricti  ng  effect  of  the  assumed  prior  distribution.  The  Bayesian 
ability  estimation  procedure  consistently  yielded  ability  estimates  that 
were  lower  (closer  to  the  mean  of  the  assumed  prior  distribution)  than 
the  maximum  likelihood  ability  estimates,  even  when  ability  estimates  were 
obtained  from  the  two  procedures  using  the  same  set  of  Items.  When  ability 
estimates  were  obtained  from  the  Bayesian  procedure  on  the  maximum  likeli¬ 
hood  Items  using  a  higher  prior  mean,  the  ability  estimates  Increased  to 
the  same  level  as  the  maximum  likelihood  ability  estimates.  However, 
using  a  higher  prior  did  not  significantly  increase  the  Bayesian  ability 
estimates  on  the  Bayesian  items.  Because  the  maximum  likelihood  Items  had 
significantly  higher  b-values.  It  was  hypothesized  that  raising  the  prior 
would  affect  ability  estimates  only  if  items  were  selected  on  the  basis 
of  the  new  prior  (i.e.,  with  higher  b-values).  These  results  Indicate  that, 
had  a  higher  prior  distribution  of  ability  been  assumed  for  the  summer 
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Bayesian  tailored  tests,  the  procedure  would  have  administered  items 
with  greater  b-values  and  would  have  yielded  ability  estimates  close  to 
the  magnitude  of  the  maximum  likelihood  ability  estimates.  For  the 
winter  semester,  due  to  the  actual  differences  in  ability  between  the 
two  groups,  a  higher  prior  probably  would  have  significantly  Increased 
the  Bayesian  ability  estimates,  but  not  to  the  level  of  the  maximum  like¬ 
lihood  ability  estimates. 

The  results  of  the  ability  estimate  analyses  lead  to  two  general  con¬ 
clusions.  The  first  conclusion  is  that  use  of  an  inappropriate  prior  dis¬ 
tribution  of  ability  in  the  estimation  of  ability  may  significantly  in¬ 
crease  the  test  length  required  to  obtain  accurate  ability  estimates.  The 
greater  the  degree  of  inappropriateness  of  the  assumed  prior,  the  longer 
the  tailored  test  will  have  to  be  to  obtain  good  ability  estimates.  The 
second  conclusion  is  that  the  commonly  assumed  prior  distribution  of  ability 
will  not  be  appropriate  for  a  heterogeneous  group.  The  same  prior  was  used 
for  the  winter  and  summer  examinees,  two  groups  clearly  different  in  ability. 
For  the  winter  semester  the  effect  of  the  inappropriate  prior  was  not 
as  pronounced  as  it  was  for  the  summer  session. 

These  conclusions  have  special  significance  for  criterion  referenced 
type  testing,  where  some  absolute  level  of  performance  is  sought.  An  in¬ 
appropriate  prior  could  prevent  an  examinee's  ability  estimate  from  reaching 
the  criterion,  or  could  artificially  elevate  the  ability  estimates  to  a 
level  above  the  criterion.  Making  valid  decisions  in  such  situations  would 
be  quite  difficult. 

Items  Administered 


Analysis  of  the  items  administered  by  the  two  procedures  indicated 
that  the  maximum  likelihood  procedure  administered  items  with  higher  b- 
values  than  the  Bayesian  procedure.  There  appeared  to  be  two  reasons  for 
this.  First,  the  Bayesian  ability  estimates  were  lower  than  the  maximum 
likelihood  ability  estimates,  and  therefore  the  administration  of  items 
appropriate  for  the  current  ability  estimate  resulted  in  the  selection  of 
easier  items.  Second,  the  item  selection  procedure  used  by  the  Bayesian 
tailored  test  procedure  selected  items  with  b-values  more  highly  correlated 
with  the  ability  estimates  than  did  the  item  selection  procedure  employed 
by  the  maximum  likelihood  tailored  test  procedure.  The  result  of  this  was 
to  strengthen  the  effect  of  the  lower  ability  estimates  yielded  by  the 
Bayesian  procedure.  The  effect  of  the  inappropriate  prior  might  have  been 
less  had  the  Bayesian  procedure  selected  items  on  the  basis  of  information. 

Further  comparisons  indicated  that  the  maximum  likelihood  proce¬ 
dure  administered  items  with  higher  ^-values  than  the  Bayesian  procedure. 
This  was  probably  due  to  the  fact  that  selection  using  the  Information 
function  more  heavily  weighted  the  a-value*  in  the  selection  of  Items 
than  did  the  Bayesian  procedure.  No  differences  were  found  In  the  £- 
values  of  the  items  administered  by  the  two  procedures. 
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Goodness  of  Fit 


The  results  of  the  ability  estimate  analyses  indicated  that  the 
Bayesian  procedure  may  have  underestimated  ability.  As  a  result,  an 
examinee  would  have  had  a  higher  probability  of  correctly  responding 
to  Items  than  would  have  been  predicted  by  the  model  on  the  basis  of 
their  Bayesian  ability  estimate.  This  would  result  in  poorer  fit  of 
tne  3PL  model  to  the  data  when  using  Bayesian  ability  estimates  than 
when  using  the  maximum  likelihood  ability  estimates.  This  was  found 
to  be  the  case.  The  MSD  value  obtained  for  the  Bayesian  procedure  was 
significantly  greater  than  the  MSD  value  obtained  for  the  maximum 
likelihood  procedure. 

Oescriptive  Statistics 

The  results  of  the  analyses  of  the  mean  proportion  correct  for  each 
test  at  first  appeared  inconsistent  with  the  results  of  other  analyses. 

The  maximum  likelihood  tailored  tests  were  found  to  be  significantly 
less  difficult  than  the  Bayesian  tailored  tests.  Since  the  maximum 
likelihood  procedure  administered  items  with  greater  b-values,  it  was 
expected  that  the  maximum  likelihood  tests  would  be  found  to  be  more 
dlfficu  It  than  the  Bayesian  tests.  Further  analyses  apparently  resolved 
this  conflict.  Since  the  winter  Bayesian  ability  estimates  stabilized 
at  a  lower  level  than  the  maximum  likelihood  ability  estimates,  it  was 
hypothesized  that  the  examinees  taking  the  Bayesian  tests  were  of  lower 
vocabulary  ability  than  the  examinees  taking  the  maximum  likelihood  tests. 
This  would  explain  why  the  examinees  taking  the  Bayesian  tests  received 
easier  items  than  the  examinees  taking  the  maximum  likelihood  tests 
but  missed  more  items.  For  the  sunnier . semester  the  Bayesian  ability 
estimates  did  not  stabilize,  indicating  that  the  examinees  taking  the 
Bayesian  tests  were  of  vocabulary  ability  closer  to  the  ability  of  the 
examinees  taking  the  maximum  likelihood  tests  than  was  the  case  with 
the  winter  group.  In  support  of  this  interpretation  was  the  finding  that 
for  the  summer  semester  the  mean  proportion  correct  for  the  two  proce¬ 
dures  were  not  significantly  different. 

The  results  of  these  analyses  leave  unclear  the  degree  to  which  the 
difference  In  ability  estimates  obtained  from  the  two  procedures  was  due 
to  actual  differences  in  vocabulary  ability  and  how  much  of  the  difference 
was  due  to  differences  In  the  ability  estimation  and  Item  selection  pro¬ 
cedures.  However,  on  the  basis  of  the  recalculated  ability  estimates 
discussed  In  conjunction  with  the  ability  estimate  analyses.  It  would 
appear  that  a  substantial  part  of  the  difference  was  due  to  the  inter¬ 
action  of  the  ability  estimation  procedure,  including  the  prior,  with 
the  Item  selection  procedure. 

The  other  statistic  compiled  for  the  two  procedures  was  mean  testing 
time.  An  ANOVA  on  mean  testing  times  indicated  that  there  was  not  a  sig¬ 
nificant  difference  in  the  amount  of  time  the  two  types  of  tailored  tests 
required  when  the  number  of  items  administered  was  the  same. 


Latency 

The  correlations  obtained  between  mean  latencies  and  ability  esti¬ 
mates  followed  no  meaningful  pattern.  For  the  winter  semester  the  mean 
latenci  es  were  not  significantly  correlated  with  the  ability  estimates. 
That  is,  the  magnitude  of  the  ability  estimates  apparently  had  no  bearing 
on  the  mean  amount  of  time  to  respond.  For  the  summer  semester  the  corre¬ 
lations  were  significant,  but  there  were  no  differences  indicated  between 
the  two  procedures.  The  significant  correlations  for  the  surnner  indicated 
that  the  brighter  students  took  less  time  to  respond  to  the  items.  No 
hypothesis  could  be  produced  to  explain  why  the  summer  and  winter  groups 
behaved  differently. 

The  final  analysis  performed  was  the  comparison  of  mean  latencies 
for  correct  and  incorrect  responses.  The  ANOVA  on  the  response  latencies 
indicated  that  response  time  was  greater  for  incorrect  responses  than 
for  correct  responses,  but  no  differences  between  the  two  procedures  were 
indicated. 


Nonconvergence 

Nonconvergence  was  not  actually  a  research  question  in  this  study, 
but  whenever  maximum  likelihood  estimation  procedures  are  employed  it  is 
an  important  issue.  Earlier  studies  using  maximum  likelihood  estimation 
in  tailored  test  procedures  (Koch  and  Reckase,  1978,  1979;  McKinley  and 
Reckase,  1980a),  found  nonconvergence  to  be  a  serious  problem  for  the  3PL 
model.  The  incidence  of  nonconvergence  was  reduced  by  properly  selecting 
entry  points  into  the  item  pool  and  more  accurately  linking  the  item 
calibrations  used  in  the  tailored  testing  procedure,  but  nonconvergence 
was  not  completely  eliminated.  An  important  observation  concerning  the 
current  study  is  that  there  were  no  cases  of  nonconvergence. 

Summary  and  Conclusions 

Previous  studied  investigating  alternatives  for  the  various  compo¬ 
nents  of  tailored  testing  indicated  that  3PL  model  was  preferred  to  the 
1PL  model.  It  was  also  found  that  the  LOGIST  calibration  program  was 
better  than  the  ANCILLES  procedure  for  calibrating  the  item  pool.  Once 
these  components  had  been  selected  several  studies  were  undertaken  to 
determine  the  optimal  operational  characteristics  of  a  tailored  testing 
procedure  using  these  components.  The  present  study  was  designed  to 
compare  alternative  ability  estimation  and  item  selection  procedures. 

This  study  involved  a  live  tailored  testing  comparison  of  a  tailored 
testing  procedure  based  on  a  Bayesian  ability  estimation  procedure  and 
a  tailored  testing  procedure  based  on  maximum  likelihood  ability  estima¬ 
tion.  The  Bayesian  tailored  testing  procedure  selected  items  so  as  to 
minimize  the  posterior  variance  of  the  ability  estimate  distribution, 
while  the  maximum  likelihood  tailored  testing  procedure  selected  items 
so  as  to  maximize  the  item  Information  for  the  current  ability  estimate. 
Attempts  were  made  to  first  determine  the  optimal  test  length  for  the 
two  procedures,  and  then  to  compare  the  procedures  at  those  test  lengths, 
as  well  as  at  a  20  item  test  length. 


Analyses  indicated  that  the  optimal  test  length  of  the  maximum  like¬ 
lihood  procedure  was  about  12  items,  while  the  optimal  length  of  the 
Bayesian  test  was  14  items.  Comparisons  of  the  two  procedures  at  these 
test  lengths  and  at  the  20  Item  length  yielded  the  following  results. 

There  was  no  difference  at  any  test  length  between  the  two  procedures  in 
terms  of  reliability.  The  Bayesian  procedure  did  yield  greater  mean  total 
test  information  than  did  the  maximum  likelihood  procedure.  However,  it 
was  found  that  the  higher  information  of  the  Bayesian  procedure  was  due 
to  the  regression  of  the  ability  estimates  to  the  mean  of  the  assumed  prior 
distribution  of  ability.  In  the  range  of  ability  where  there  were  ability 
estimates  for  both  procedures  the  difference  in  total  test  information  was 
negligible.  Further  analyses  showed  that  the  assumption  of  different 
priors  can  significantly  alter  the  ability  estimates  obtained  from  a  Baye¬ 
sian  tailored  test,  as  well  as  the  total  test  information  yielded  by  the 
tailored  test.  It  was  found  that  the  more  inappropriate  the  prior  the 
longer  the  Bayesian  tailored  test  had  to  be  to  obtain  accurate  ability 
estimates.  Thus,  the  winter  semester  Bayesian  tests  yielded  stable 
ability  estimates  by  the  twelfth  item,  on  the  average,  while  the  summer 
semester  Bayesian  ability  estimates  generally  did  not  converge  to  a  stable 
value.  This  was  consistent  with  the  finding  that  the  subjects  in  the 
summer  semester  were  of  higher  vocabulary  ability  than  were  the  winter 
semester  subjects.  Analyses  of  the  item  selection  procedures  indicated 
that  selection  of  items  to  minimize  the  posterior  variance  of  the  ability 
estimates  magnified  the  effect  of  the  inappropriate  prior.  The  goodness 
of  fit  comparison  indicated  that  the  Bayesian  procedure  yielded  signifi¬ 
cantly  poorer  fit  of  the  3PL  model  to  the  data  than  did  the  maximum  like¬ 
lihood  procedure,  which  was  consistent  with  the  finding  that  the  Bayesian 
ability  estimates  were  too  low. 

Based  on  the  results  reported  above  it  was  concluded  that  selection 
of  an  inappropriate  prior  significantly  increased  the  test  length  required 
for  accurate  estimation  using  a  Bayesian  tailored  test.  At  any  length 
less  than  the  optimal  test  length,  Bayesian  ability  estimates  are  biased 
in  the  direction  of  the  mean  of  the  prior  distribution.  If  testing  con¬ 
tinues  beyond  the  optimal  test  length,  bias  is  again  introduced  into  the 
ability  estimates  if  inappropriate  items  are  administered.  Because  the 
optimal  test  length  varies  depending  on  the  appropriateness  of  the  prior, 
in  order  to  avoid  bias  in  the  ability  estimates  it  is  essential  to  deter¬ 
mine  an  appropriate  prior.  Also,  it  is  clear  from  this  study  that  the 
N(0,1)  prior  can  be  appropriate  for  only  a  relatively  homogeneous  group. 

For  large  heterogeneous  groups  determination  of  an  appropriate  prior  is 
much  more  difficult,  and  bias  in  the  ability  estimates  can  often  result. 
Therefore , the  Bayesian  tailored  testing  procedure  seems  appropriate  only 
when  good  prior  information  can  be  obtained.  For  large  scale  tailored 
testing  a  maximum  likelihood  tailored  testing  procedure  with  item  selection 
based  on  information  is  the  procedure  of  choice. 
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